The present disclosure is directed to a distributed edge secure storage network utilizing redundant heterogeneous storage. In one embodiment, N storage nodes that are coupled via a network are selected to store a file of size |F| and redundancy of size |Fred|. At least two of the N storage nodes allocate different sizes of memory for storing the file. The N storage nodes are ordered from a largest storage capacity at the first storage node to a smallest capacity |sN| at the Nth storage node. A value Z<N is selected such that an attacker having access to Z storage nodes is unable to decode any partial information about the file. The file is divided into d partitions of size |Ps
These and other features and aspects of various embodiments may be understood in view of the following detailed discussion and accompanying drawings.
The discussion below makes reference to the following figures, wherein the same reference number may be used to identify the similar/same component in multiple figures.
The present disclosure generally relates to distributed data storage systems. Due to, among other things, the widespread adoption of mobile devices and the “Internet of things” (IoT), data is being generated exponentially. It is estimated by one source that data creation will grow to an enormous 163 zettabytes by 2025, which is ten times the amount of data created in 2017. This stored data can include large amounts of automatically generated data, e.g., data generated by sensors. The sensor data may include the raw data captured by end devices as well as the data generated as the result of analyzing the raw data. One solution for storing and analyzing large amounts of data involves transferring it to large data centers, which is commonly referred to as cloud storage.
Assuming network traffic grows exponentially, it may become increasingly difficult to send all of the created data to cloud for storage, especially for time-critical applications. In some emerging technologies, such as smart cities and autonomous cars, the data may need to be analyzed and stored in real-time, which may be difficult to do in the cloud. Among other things, cloud computing may be affected by relatively high latency (e.g., the cloud storage facility may be located a long distance from where the data is generated) as well as due to unpredictability of network connections (e.g., due to spikes in demands, outages, etc.).
An alternative to analyzing dynamically generated sensor data in the cloud is using distributed edge storage, where a large portion of data F is divided into partitions and each partition is stored in an available edge device close to the data source. In
Distributed edge storage provides data availability closer to where it is needed and reduces delay. For example, a user terminal 110 in proximity to the edge layer 106 may actively monitor the sensors 104 and/or storage nodes 102 looking for patterns and/or events that occur in near real-time. The devices of the edge layer 106 may also send some of the data to the cloud storage, e.g., for archival purposes, and the terminal 110 may also access data there in situations that are not sensitive to time delays.
Storing data on edge devices can create a more responsive system, however may also risk data security. For example, the edge devices may have limited capabilities (e.g., computation, memory) and therefore may not be able to implement multiple layers of security without unduly limiting performance. The edge devices may also not be under the control of a single entity, which can make enforcing security policies difficult. This disclosure describes a security scheme that addresses security challenges that are specific to edge devices, but that may be applicable to other storage systems, e.g., centralized or decentralized distributed storage systems.
For distributed edge storage setup, one appropriate attack model is the case where a number of edge devices are compromised. More specifically, an eavesdropping attack is a scenario in which the attacker (eavesdropper) controls a group of edge devices and spies on the data stored in them. The goal is to keep data confidential from the devices under attack used as distributed storage nodes. An eavesdropping attack scenario according to an example embodiment is shown in the block diagram of
Data 200 (e.g., a file F) is stored in a distributed manner among a group of edge storage nodes 202, which may be accessible via a network. An end user may desire to store the file F via a network connected computer device, such as user terminal 110 shown in
In this example, a subset 204 of the edge storage nodes 202 can be accessed by the attacker 206 such that the attacker 206 can at least view the data of interest stored on the subset 204. For purposes of this disclosure, the value Z signifies the maximum number of nodes to which the attacker 206 has access. The system is designed such that the attacker 206 cannot read any partial information about the data file 200 with access to only Z nodes. An authorized user will have access to more than Z of the nodes and therefore can read the data file 200. In some embodiment, the authorized user will need access to all of the edge nodes 202 to read the data file 200, and in other embodiments the authorized user may be able to read the data file 200 with fewer than all of the nodes 202, but more than Z nodes.
Secret sharing schemes using linear coded keys, addresses eavesdropping attacks, where data is divided into shares with equal sizes and each share is masked with linear coded keys and stored in one of the available storage nodes. For instance, assume there are M=4 available storage nodes of ={s1, s2, s3, s4}. The file F is first divided into two equal shares of f1 and f2, and keys k1 and k2 are generated. Then, the four packets of Ps
The edge devices 202 are heterogeneous with different memory, compute, bandwidth, power, etc. Direct application of the existing secret sharing schemes may yield poor performance for distributed edge storage as they do not take into account the heterogeneity of storage nodes. For example, if the four storage nodes s1, s2, s3, and s4 have different allocated storage availability, then the stored packets of s1, s2, s3, and s4 should have different sizes. For purposes of this disclosure, the term “storage availability” is used to describe the capability of a storage node to fulfill a request that meets some minimum requirement. The storage node should not only have the available capacity to store the data but should have performance when writing to and reading from the storage (which includes network communications) that satisfies a minimum standard defined for the system. In
In
This disclosure covers, among other things (i) how to select storage nodes among all candidate storage nodes, (ii) how to partition file F, (iii) how to generate the keys, and (iv) how to create packets to be stored in the selected storage nodes. In
Based on the definition of the partitions from file partitioner 404, a processing module 413 generates linear coded partitions, which are referred to as hi's. A key management section 412 includes a key generation module 414 that generates a key set for the file 400 and a combination module 416 linearly combines the keys of the set into linear coded keys, gi's. A packet generation module 418 uses the definition 410 and the linear coded keys to generate and store the file 400 on the network 406. A similar set of modules may be used to read data from the network 406, based on the definition 410, download the partitions/packets, unencrypt the partitions/packets, and reassemble the file.
In this system model, there are M heterogeneous edge devices 406 that can be used as distributed storage nodes. The set of all candidate storage nodes is denoted by ={s1, s2, . . . , sM}. First, a subset 408 of all M available storage nodes are selected to be used for storing data F, securely. The subset 408 of selected storage nodes is denoted by ={s1, s2, . . . , sN}, where N≤M. Then, the set of packets 410={ps
It is assumed that the system is vulnerable to an attack, where the capability of the attacker is characterized in terms of parameter Z<N. More specifically, the attacker 420 can access the data stored in at most Z storage nodes (e.g., two nodes s1 and s5 as shown in the example of
The data in the file 400 is stored such that the attacker 420 cannot get any meaningful information about data. More specifically, the proposed solution provides information theoretic secrecy defined as H(F|)=H(F), where H(.) is the information theory entropy and is the data stored in any storage set ⊂, such that ||=Z. One of the applications of information theoretic secrecy is where a linear combination of the data partitions can reveal some meaningful information about the whole data. In the proposed method, one goal is to keep any linear combination of the data partitions confidential from any subset of storage nodes with size Z.
Another goal in designing the distributed storage system is to add redundancy such that data F can be retrieved by having access to the data stored in t devices, where Z<t≤N. The reason behind this consideration is that the edge devices are mobile and the encounter time of the authorized user with the storage nodes may vary over time, e.g., the storage nodes may be offline from time to time. In addition, edge devices are not designed for enterprise purposes and thus their tolerance threshold against failure might be low. Therefore, the goal is providing the ability to retrieve data by having access to less than all N storage nodes in case some storage nodes become unavailable due to mobility or failure.
As shown in the flowchart of
Storage Selection
In order to use the available resources efficiently, the minimum required resources for creating and storing keys are determined such that the privacy conditions are met. All the remaining available resources are utilized to add redundancy such that the designed system can be more robust to edge-failure/loss. The minimum requirement to satisfy the information theory privacy for storing a file partition f in a storage node is to mask it with a random key that has the same size as f, e.g., f+k, where |k|=|f|. In addition, to keep data confidential from any Z storage nodes, the packets stored in any Z storage nodes should be linearly independent.
For this linear independence constraint to be satisfied, one requirement is that for any stored packet Ps
In order to add redundancy to the system, more storage will be allocated, e.g., by adding more storage nodes. Thus, for selecting storage nodes the requirement of Σi=Z+1N|si|>|F|+|Fred| should be satisfied, where |Fred| is the estimated desired redundancy. For example, |Fred| can be set as |F|. As explained next, the storage system improves this estimation by taking into account the system parameter t, the number of storage nodes that an authorized user should have access to in order to retrieve data F. Therefore, in the first step, the N storage nodes are selected such that Σi=Z+1N|si|>2.5|F| is satisfied. In the next step, the size of each packet to be stored in each storage node si and subsequently the value of parameter t that can be achieved with this set of storage nodes are determined.
In order to provide privacy, the size of key required to mask file F is restricted by the maximum packet size stored over all N storage nodes. To minimize the size of key and thus providing opportunity to add more redundancy to the system, it is desired to use the full capacity of storage nodes with smaller storage sizes other than using all available memory in storage nodes with larger storage sizes. On the other hand, in order to decrease the complexity, it is desired to partition data F into larger parts, so that the number of file partitions is smaller. For this purpose, the storage system uses the maximum storage size from the allocated size of sN for storing packet Ps
For the last N−Z storage nodes, each Ps
The amount of information that can be obtained by having access to t selected storage nodes among all N storage nodes is shown in Equation (2), where, tj is the number of storage nodes that are selected from the set among all selected t storage nodes. Note that in Equation (2), n1=nZ+1.
info=|Ps
The probability that file F can be retrieved by having access to t random storage nodes among all N storage nodes is equal to the probability that the amount of information obtained from any t random storage nodes is greater than or equal to |F|. It can be proved that this probability is calculated using Equation (3) below, where d=|F|/|Ps
In
First, the size of each packet to be stored in each selected storage node is determined. The maximum storage size from the allocated size of sN for packet Ps
Next, parameter t is determined, which is the number of nodes that an authorized user should have access to in order to be able to retrieve the whole data F with threshold probability of 60%. In other words, t is the minimum number of storage nodes that an authorize user should have access to the packets stored in them in order to retrieve data F with probability 60%. Using the given formulation in Equation (3), the minimum t for the given threshold of 60% is equal to t=4. Note that the 60% threshold probability is a predefined system requirement, and could be set to other values.
With this threshold probability, the probability that an authorized user can retrieve data F by selecting t random storage nodes out of N storage nodes is greater than or equal to 60%. However, the user can quickly check the number of blocks at each storage and figure out if it can retrieve the data or not. In case it could not retrieve the data, it can select another set of t storage nodes randomly; this increases the probability of success in retrieving the data to 1−(1−0.6)2=0.84=84%, which is significant. Considering the probability of being able to retrieve data F in multiple rounds can also be the criteria for determining the threshold probability for pr(info≥|F|).
File Partition and Constructing hi's
The first Z storage nodes are allocated to store the keys only, however the remaining N−Z storage nodes store the file partitions masked with keys. More specifically, each storage si, Z<i≤N, stores ni blocks each with size |Ps
The total number of blocks stored in the last N−Z storage nodes is equal to Σi=Z+1Nni. In order to construct the first part of these blocks, first file F is divided into d=|F|/|Ps
Note that the complexity of constructing hi's increases with increasing the number of file partitions, d, as the size of matrix and vector to be multiplied to create hi's, will be larger. Therefore, it is desired to select smaller values for parameter d, which results in dividing file F into smaller number of partitions.
Applying these partitioning concepts to the example arrangement shown in
Key Generation and Constructing gi's
As mentioned in the previous section, the first Z storage nodes store keys only. Each remaining storage si, Z+1≤i≤N, stores ni blocks each with size |Ps
The minimum required number of key blocks to keep data confidential from any Z storage nodes is restricted by the number of blocks required for the largest stored packet. More specifically, ZnZ+1 is the minimum required number of key blocks, where the size of each block is |Ps
Each first Z storage node si, 1≤i≤Z stores ni=nZ+1 key blocks. More specifically, as will be explained in section “Packet Generation”, the set of key blocks {kj|(i−1)n1+1≤j≤in1} (where n1 is equal to nZ+1) will be stored in storage node si, 1≤i≤Z. The second parts of blocks stored in the remaining N−Z storage nodes, denoted by gi's, are constructed sequentially from the first blocks for all storage nodes to the last blocks for the corresponding storage nodes. The number of blocks for the last N Z storage nodes varies between 1 (for the Nth storage) to n1 (for the Z+1st storage).
Note that the lth blocks should be constructed such that any subset of these blocks with size Z is a linearly independent set. For instance, to construct the lth blocks (1≤l≤n1), the Vandermonde matrix with Z columns and as many rows as required can be used. In this way, any subset of Z packets are linearly independent. One other requirement for constructing gi's is that the probability of retrieving data F by having access to t random edge storage nodes is maximized. To satisfy these two requirements, first, {g1, g2, . . . , gN−Z} are constructed to be used in the 1st blocks stored in the last N−Z storage nodes. To construct these gi's, N−Z independent linear combinations of the first blocks of keys stored in the first Z storage nodes are created as defined in Equation (7).
The number of storage nodes requiring the 2nd block for their stored packet is equal to N−Z−N1, where N1 is the number of storage nodes containing only one block. {gN−Z+1, gN−Z+2, . . . , g2(N−Z)−N
Similarly, {gl|N−Z+Σm=1i−2(N−Z−Σj=1mNj)+1≤N−Z+Σm=1i−1(N−Z−Σj=1mNj)} are used in the ith blocks of packets stored in the storage nodes {sl|Z+1≤l≤N−Σj=1i−1Nj}. These blocks are constructed as shown in Equation (9).
Applying this to the previous example of
The key combinations {g1, g2, g3, g4} are used in the first blocks of storage nodes s3, s4, s5, s6 and are constructed as linear combinations of k1, k6 (the keys used in the 1st blocks of storage nodes s1, s2). The key combinations {g5, g6, g7} are used in the 2nd blocks of storage nodes s3, s4, s5 and are constructed as linear combinations of k2, k7 (the keys used in the 2nd blocks of storage nodes s1, s2). The key combinations {g8, g9, g10} are used in the 3rd blocks of storage nodes s3, s4, s5 and are constructed as linear combinations of k3, k8 (the keys used in the 3rd blocks of storage nodes s1, s2). The key combinations {g11,g12} are used in the 4th blocks of storage nodes s3, s4 and are constructed as linear combinations of k4, k9 (the keys used in the 4th blocks of storage nodes s1, s2). The key combination {g13} is used in the 5th block of storage node s3 and is constructed as the linear combination of k5, k10 (the keys used in the 5th blocks of storage nodes s1,s2).
Note that any other selection of keys for constructing gi's will either result in (i) delivering less info by having access to the data stored in a random selection of t storage nodes, where Z+1≤t<N or (ii) breaks the privacy conditions required to protect data from any Z storage nodes. The key combinations gi, 1≤i≤13 are constructed using partial of the Vandermonde matrix created for constructing hi's as shown in
Packet Generation
The packet stored in each storage node includes two parts. The first part is a function of the constructed hi's generated from the file partitions and the second part is a function of the constructed gi's generated from keys. These parts are combined to generate the packets to be stored in the N storage nodes. The first Z storage nodes are assigned to store keys only. As explained before, the allocated storage sizes for all these storage nodes are the same and equal to |Ps
The packets stored in the last N−Z storage nodes contain file partitions masked with keys. There are Σi=Z+1Nni blocks totally in the last N−Z storage nodes and there are Σi=Z+1Nn1 generated hi's in (6) that will play the role of the first part of the packets, file partitions. Note that some of these file partitions are redundant, which are used to make the system resilient to the loss/failure of N−t storage nodes. Each block is masked with the minimum number of keys required to provide privacy, e.g., Z, as unnecessarily adding more keys requires to have access to more blocks to be able to subtract the key part and extract the file part from a packet inquired by an authorized user. In other words, an authorized user should be able to get the maximum info calculated in Equation (2) by having access to any t storage nodes. The details on the design of packets to achieve this objective are provided below.
All jth blocks of the storage nodes requiring at least j blocks are masked with the same set of Z keys. In this way, an authorized user can get the most info by having access to heterogeneous storage nodes. For this purpose, the data stored in the first blocks of Ps
with |PsN|be equal to the size of the file partitions fi in bits.
Returning to the previous example, the packets stored in the first two storage nodes s1, s2 store keys only (see
The methods described above are scalable and adaptive to increasing the size of data F. This is applicable in real-time applications, where the size of data F is constantly increasing over time. These methods are adaptive to adding more storage nodes to the system, which is applicable in dynamic edge environments, where new edge devices may join the network. The methods are also adaptive to increasing the size of a storage node. This is also applicable in dynamic edge environments, where more memory usage in an edge device may become available. The following paragraphs explain how the distributed storage system can be extended to be adaptive and scalable.
More information can be added to the stored file F, once enough amount of data with size equal to the size of one file partition, |Ps
In the next step, the only modification that is required is for hl, ∀(d+1)≤l<Σi=Z+1Nni to be added with αl−ddfd+1, e.g. hl=hl+αl−ddfd+1, where αl−dd corresponds to adding one more column to the created Vandermonde matrix in Equation (5) and thus the corresponding blocks using hl, ∀(d+1)≤l<Σi=Z+1Nni are updated by adding αl−ddfd+1. Note that as d is now updated as d=d+1, the system parameter t, the number of storage nodes that an authorized user should contact to retrieve data F, should be updated using Equation (3).
Note that here the goal is to modify the minimum number of blocks with the minimum required complexity for modifying those blocks. That is why the added file partition is used to replace the last previously generated block. Note that, this strategy does not comply with the strategy to minimize the complexity of extracting data for an authorized user. Therefore, if the priority is to minimize the complexity of extracting data for an authorized user, then Equation (6) could be used to recalculate all elements hl, ∀1≤l≤Σi=Z+1Nni by using the updated vector of file partitions and then regenerate all packets using the updated hi's, which requires higher computational complexity for regenerating packets.
A new storage node can be added to the system by first determining the number of blocks it can store and then creating a new row to the Vandermonde matrix V used in Equations (6) and (9) and constructing as many hi's and gi's as the determined number of blocks it can store. Note that the added storage node may increase or decrease the probability of retrieving data F for an authorized user contacting t random storage nodes for a given t. Therefore, the updated pr(info>d) should be calculated using the updated parameters according to Equation (3).
If more memory becomes available in an edge device used as one of the distributed storage nodes, more blocks can be added to the packet stored in that storage node under certain conditions. The condition for adding one more block to the storage node si already containing ni blocks, is that there are at least Z other storage nodes that contain at least ni+1 blocks (or can be modified to contain ni+1 blocks). If this condition is satisfied, the new block is added as the following.
For adding one more block to one of the first Z storage nodes, a new key should be created and stored as the last block in the stored packet. Note that this key will be used to add one more block to one of the other storage nodes containing file partitions. For adding one more block to one of the other storage nodes, i.e., si, i>Z, a new hl is constructed by creating a new row of the Vandermonde matrix in (5). Then, a new gl is constructed by multiplying the corresponding row of the Vandermonde matrix with the key vector containing the keys of the (ni+1)st blocks of the first Z storage nodes. In the last step, the constructed hl and gl are summed and the new block containing hl+gl is added as the last block of Ps
In this document, a framework based on linear coded keys is described. This framework is resilient to the failures or losses of N−t edge devices used as distributed storage nodes and provides information theoretic secrecy for any Z storage nodes, e.g., no adversary having access to the data stored in any z≤Z storage nodes can get any meaningful information about F that is stored distributedly among the N available edge devices. It can be proved that this system is optimal in terms of creating the minimum required number of keys and delivering the maximum information to an authorized user contacting any t>Z storage nodes
In
The instructions 1508 are operable to cause the processor 1502 to select N storage nodes 1514a that are coupled via the 1512 network to store file 1516 of size |F| and redundancy of size |Fred|. At least two of the N storage nodes 1514a allocate different sizes of memory for storing the file 1516. The N storage nodes 1514a are ordered from a largest storage capacity at the first storage node to a smallest capacity |sN| at the Nth storage node. The processer 1502 selects a value Z<N, such that an attacker having access to Z storage nodes is unable to decode the file 1516. The file 1516 is divided into d partitions of size |Ps
The instructions 1508 are operable to cause the processor 1502 to create independent linear combinations hi's of the d partitions of the file 1516. The processor 1502 generates keys that are stored in the first Z of the N storage nodes 1514a and creates independent linear combinations gi's of the generated keys. The processor 1502 stores combinations of the hi's and gi's in the (Z+1)st to Nth storage nodes. The processor 1502 of this or another similar device may be operable to read the file from t of the N storage nodes 1514a, where t<N. The t storage nodes may be randomly selected, and the processor 1502 can determine if the file can be read from the t storage nodes. If not, a second set oft storage nodes can be randomly selected, then the file read from the second set. The sets can be iteratively selected and re-read over more than two rounds until the file is successfully read.
The various embodiments described above may be implemented using circuitry, firmware, and/or software modules that interact to provide particular results. One of skill in the arts can readily implement such described functionality, either at a modular level or as a whole, using knowledge generally known in the art. For example, the flowcharts and control diagrams illustrated herein may be used to create computer-readable instructions/code for execution by a processor. Such instructions may be stored on a non-transitory computer-readable medium and transferred to the processor for execution as is known in the art. The structures and procedures shown above are only a representative example of embodiments that can be used to provide the functions described hereinabove.
The foregoing description of the example embodiments has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the embodiments to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. Any or all features of the disclosed embodiments can be applied individually or in any combination are not meant to be limiting, but purely illustrative. It is intended that the scope of the invention be limited not with this detailed description, but rather determined by the claims appended hereto.