The present disclosure is directed to a distributed secure storage network utilizing a cost function to allocate storage. In one embodiment, a method involves selecting a value Z such that an attacker having access to Z storage nodes is unable to decode any partial information of a file of size |F| stored in a network that distributedly stores the file in more than the Z storage nodes. A number N* of the storage nodes is selected that minimizes a cost function that includes |F|, Z, an initial data access cost CT, and a transmission and downloading cost Cd. Equal memory sizes are initially allocated from the N* of the storage nodes to store the file and a set of linear code keys. A first cost of adding more storage nodes to the N* storage nodes and a second cost of allocating more memory from a subset of the N* storage nodes are iteratively determined. Based on a minimal cost determined from the iterative determinations, the file and linear code keys are stored in N≥N* of storage nodes, individual keys stored in a first through Zth storage nodes and independent linear combinations of the keys and partitions of the file stored in a Z+1th to Nth storage node.
In another embodiment, N edge storage nodes that are coupled via a network are selected to store a file of size |F|. The N edge storage nodes have non-homogeneous storage availability and are ordered from a largest storage availability at the first edge storage node to a smallest availability at the Nth edge storage node. A value Z<N is selected, such that an attacker having access to Z edge storage nodes is unable to decode any partial information of the file. The first through Z+1th edge storage nodes have a same assigned packet size. Keys are stored in the first Z edge storage nodes and independent linear combinations of the keys combined with partitions of the file are stored in the Z+1th to the Nth edge storage nodes.
These and other features and aspects of various embodiments may be understood in view of the following detailed discussion and accompanying drawings.
The discussion below makes reference to the following figures, wherein the same reference number may be used to identify the similar/same component in multiple figures.
The present disclosure generally relates to distributed data storage systems. Due to, among other things, the widespread adoption of mobile devices and the “Internet of things” (IoT), data is being generated exponentially. It is estimated by one source that data creation will grow to an enormous 163 zettabytes by 2025, which is ten times the amount of data created in 2017. This stored data can include large amounts of automatically generated data, e.g., data generated by sensors. The sensor data may include the raw data captured by end devices as well as the data generated as the result of analyzing the raw data. One solution for storing and analyzing large amounts of data involves transferring it to large data centers, which is commonly referred to as cloud storage.
Assuming network traffic grows exponentially, it may become increasingly difficult to send all of the created data to cloud for storage, especially for time-critical applications. In some emerging technologies, such as smart cities and autonomous cars, the data may need to be analyzed and stored in real-time, which may be difficult to do in the cloud. Among other things, cloud computing may be affected by relatively high latency (e.g., the cloud storage facility may be located a long distance from where the data is generated) as well as due to unpredictability of network connections (e.g., due to spikes in demands, outages, etc.).
An alternative to analyzing dynamically generated sensor data in the cloud is using distributed edge storage, where a large portion of data F is divided into partitions and each partition is stored in an available edge device close to the data source. In
Distributed edge storage provides data availability closer to where it is needed and reduces delay. For example, a user terminal 110 in proximity to the edge layer 106 may actively monitor the sensors 104 and/or storage nodes 102 looking for patterns and/or events that occur in near real-time. However, storing data on edge devices can risk data security. For example, the edges devices may have limited capabilities (e.g., computation, memory) and therefore may not be able to implement multiple layers of security without unduly limiting performance. The edge devices may also not be under the control of a single entity, which can make enforcing security policies difficult. This disclosure describes a security scheme that addresses security challenges that are specific to edge devices, but that may be applicable to other devices, e.g., centralized distributed storage systems.
For distributed edge storage setup, one appropriate attack model is the case where a number of edge devices are compromised. More specifically, an eavesdropping attack is a scenario in which the attacker (eavesdropper) controls a group of edge devices and spies on the data stored in them. The goal is to keep data confidential from the devices under attack used as distributed storage nodes. An eavesdropping attack scenario according to an example embodiment is shown in the block diagram of
Data 200 (e.g., a file F) is stored in a distributed manner among a group of edge storage nodes 202, which may be accessible via a network. An end user may desire to store the file F via a network connected computer device, such as user terminal 110 shown in
In this example, a subset 204 of the edge storage nodes 202 can be accessed by the attacker 206 such that the attacker 206 can view the data of interest stored on the subset 204. For purposes of this disclosure, the value Z signifies the maximum number of nodes to which the attacker 206 has access. The system is designed such that the attacker 206 cannot read any partial information of the data file 200 with access to only Z nodes. An authorized user will have access to more than Z of the nodes and therefore can read the data file 200. In some embodiment, the authorized user will need access to all of the edge nodes 202 to read the data file 200.
Secret sharing schemes using linear coded keys address eavesdropping attacks, where data is divided into shares with equal sizes and each share is masked with linear coded keys and stored in one of the available storage nodes. For instance, assume there are M=4 available storage nodes of ={s1, s2, s3, s4}. Data F is first divided into two equal shares of f1 and f2, and keys k1 and k2 are generated. Then, the four packets of Ps
The edge devices 202 are heterogeneous with different memory, compute, bandwidth, power, etc. Direct application of the existing secret sharing schemes yields poor performance for distributed edge storage as they do not take into account the heterogeneity of storage nodes. For example, if the four storage nodes s1, s2, s3, and s4 have different allocated storage availability, then the stored packets of Ps
In
In order for an entity to utilize heterogeneous edge storage nodes for secure storage, the entity may define, among other things, (i) how to select storage nodes among all candidates, (ii) how to partition file, (iii) how to generate the keys, and (iv) how to create packets to be stored in the selected storage nodes. These issues are addressed in this disclosure, as well as how the storage allocation can be optimized for cost.
In
A key generation section 412 includes a module 414 that generates a key set for the file 400 and a module 416 that linearly combines the keys of the set into linear coded keys, gi's. A packet generation module 418 uses the definition 410 and the linear coded keys to generate and store the file 400 on the network 406. A similar set of modules may be used to read data from the network 406, based on the definition 410, download the partitions/packets, unencrypt the partitions/packets, and reassemble the file.
Each packet is created using the file partitions and the generated keys. The file partitioner 404 uses the available heterogeneous resources efficiently such that the designed distributed edge storage system is secure against an eavesdropper adversary 420 attacking at most Z storage nodes. Consider a system model where there are M heterogeneous edge devices that can be used as distributed storage nodes. The set of all candidate storage nodes in network 406 is denoted by ={s1, s2, . . . , sM}. A subset 408 of all M available storage nodes are selected to be used for securely storing data F.
The set of selected storage nodes is denoted by ={s1, s2, . . . sN}, where N≤M. Then, the set of packets ={Ps
From the defender point of view, a more robust system with larger values of Z comes with the cost of an increase in the storage usage (increase in the number of distributed storage nodes) and an increase in the complexity of designing the secure system. In other words, parameter Z can be considered as a tradeoff between providing more security and the increase in the complexity of system by taking into account the vulnerability state of the system to an attack.
One goal is to store the data such that the attacker cannot get any meaningful information about data. More specifically, the proposed solution provides information theoretic secrecy defined as H(F|)=H(F), where H(.) is the information theory entropy and is the data stored in any storage set ⊂ with size Z, (||=Z). One of the applications of information theoretic secrecy is where a linear combination of the data partitions can reveal some meaningful information about the whole data set. In the proposed method, any linear combination of the data partitions are kept confidential from any subset of storage nodes of quantity Z. An authorized user can extract the data F by having access to all of the packets stored in the storage set .
In this disclosure, features are described that assist in reducing computational complexity at the design stage as well as reducing computational complexity and communication cost for an authorized user at the stage of retrieving data. In reference again to
Storage Selection
In order to provide security, the file partitions are masked with keys and thus some of the available memory resources should be allocated to store keys. The minimum requirement to satisfy the information theory privacy for storing a file partition f in a storage node is to mask it with a random key that has the same size as f, e.g., f+k, where |k|=|f|. In addition, to keep data confidential from any Z storage nodes, the packets stored in any Z storage nodes should be linearly independent. For this constraint to be satisfied, one requirement is that for any stored packet Ps
Therefore, the first requirement for the storage selection is Σi=Z+1N|si|=|F|, where |F| is the size of file F. The other requirement is that the total allocated storage size over all storage nodes should be equal to Σi=1N|si|=Σi=1Z|sZ+1|+Σi=Z+1N|si=Z|sZ+1|+|F|. This means that for allocating data F, distributedly, Z|sZ+1| extra memory is required that is used to store keys and keep data secure. However, for an authorized user to be able to retrieve data F, it should have access to the data stored in all storage nodes and download them, then it should subtract the key parts and extract the useful information F.
The cost for the authorized user to retrieve data from N storage nodes is a function of number of storage nodes as well as other parameters such as type of storage, bandwidth, power, etc., which are different for different storage nodes. The set of selected storage nodes should be chosen to minimize the cost. Even if the storage nodes are homogeneous in terms of type of storage, bandwidth, and power, but heterogeneous only in terms of available storage size, minimizing cost is still not trivial. In the following, an optimized set of selected storage nodes is found by focusing on optimizing the cost for this simplified scenario using Equation (1) below:
cost=NCT+(Z|Ps
As seen in Equation (1), cost includes two parts: (i) NCT, where CT is the initial cost for accessing the data stored at each storage node and NCT is the total cost over all storage nodes, and (ii) (Z|Ps
The goal is to select the set of storage nodes such that the cost defined in Equation (1) is minimized. To tackle this problem, the effect of |Ps
As seen in Equation (2), by increasing N, the first part of the lower bound increases and the second part decreases. Therefore, there is an optimum value for N that minimizes the lower bound, which can be calculated as shown in Equation (3) below, where N* is rounded to the optimized integer. Specifically, N1 is N* rounded to the nearest larger integer and N2 is N* rounded to the nearest smaller integer.
The cost with the optimized calculated number of storage nodes N* is larger than the calculated lower bound in Equation (2), when the edge devices are heterogeneous. However, this clue can be used to select the set for heterogeneous distributed edge storage. The strategy is to initially use equal memory sizes from the first N* storage nodes (the storage nodes with the largest available storage sizes) and if more memory is required for storing F, then iteratively decide to either (i) add more storage nodes or (ii) use more memory from each available storage. The decision between these two options is made based on minimizing the cost. The details are provided below and generally illustrated in
An initialization 500 of the storage selection involves selecting the first N* storage nodes as ={s1, s2, . . . , sN*} after ordering the nodes in the descending available storage sizes, where, N=N*. The maximum size of file F that can be equally and distributedly stored in these storage nodes is (N−Z)|sN|. Therefore, if it is determined 501 the size of file F is larger than this, then the “Modification” stage is entered, where either 510 more of storage nodes are added or 512 more memory from the available storage nodes −N is used. If the size of the file F is less than what is available in the first N* storage nodes, the set of initially selected storage nodes is finalized 502 as ={s1, s2, . . . , sN*} with N=N*, where the size of each packet stored in each storage si is equal to |Ps
The Modification stage involves determining one of these options 510, 512 with the minimum cost is selected. For the Modification stage, info=(N−Z−n)(|sN−n|−|sN−n+1|) more information can be added to the distributed storage system by (i) increasing 503 the size of each packet stored in the first N−n storage nodes by |sN−n|−|sN−n+1|, with the update |Ps
Note that if info=(N−Z−n)(|sN−n|−|sN−n+1|) is equal to 0, the first option 503 is not available and thus the second option 504 is selected by adding as many additional storage nodes as required such that info=|F|−Σi=Z+1N|Ps
After the Modification stage, which involves modifying the size of memory used from each storage or the set of selected storage nodes, then parameters should be updated. If first option 503 is selected in the Modification stage, update 505 involves updating the parameter |Ps
After making the required updates, if the condition 507 Σi=Z+1N|Ps
Next, the set of selected storage nodes, is determined. The parameter n is initialized as n=1. Initialization involves first select the N=4 edge devices with the largest available storage sizes as the set of selected storage nodes, ={s1, s2, s3, s4}, where |s1|=18 MB, |s2|=15 MB, |s3|=10 MB, |s4|=8 MB. Because s1 and s2 are reserved for key storage, the maximum size of file F that can be equally and distributedly stored in these storage nodes is (N−Z)|sN|=16 MB, where the packet sizes stored in these storage nodes are |Ps
For the Modification, info=(N−Z−n)(|sN−1|−|sN|)=2 MB more information can be added to the distributed storage system by (i) increasing the size of each packet stored in the first 3 storage nodes by 2 MB as seen in
For the second round of modification, info=(N−Z−n)(|sN−1|−|sN|) with the updated parameters is equal to 0 and thus the first option is not available. Therefore, we go for the second option that is adding as many additional storage nodes as required such that the sum of |si| over all added storage nodes is equal to the additional required info, i.e., info=|F|−Σi=Z+1N|Ps
File Partitioning
The first Z storage nodes are allocated to store the keys only, and the remaining N−Z storage nodes store the file partitions masked with keys. The file F is divided into equal partitions each with size λq, where λq is the GCF (Greatest Common Factor) of {|Ps
When applied to the example storage node arrangement shown in
Key Generation and Constructing gi's
As mentioned before, the first Z storage nodes, store keys only. Each remaining storage node si, Z+1≤i≤N, stores ni blocks each with size λq, where each block includes two parts. The first part is a file partition and the second part is a function of keys. Next, the keys stored in the first Z storage nodes are constructed as well as the second parts of packets stored in the remaining N−Z storage nodes. The minimum required number of key blocks to keep data confidential from any Z storage nodes is restricted by the number of blocks stored in the storage node with the largest memory size. More specifically, ZnZ+1 is the minimum required number of key blocks, where the size of each block is λq. Therefore, ZnZ+1 random numbers in q are generated, where q=2λq and each generated number is put into a block as shown in
Each first Z storage node si, 1≤i≤Z, stores ni=nZ+1 key blocks. More specifically, as will be explained in section “Packet Generation”, the key packet containing the set of key blocks {kj|(i−1)n1+1≤j≤in1} will be stored in storage node si, 1≤i≤Z as shown in
The set of gi's that are used in storage si+Z is {gj|(i−1)n1+1≤j≤in1}, therefore the first condition for constructing gi's is that {gj|(i−1)n1+1≤j≤in1}, ∀Z+1≤j≤N should be a linearly independent set. To satisfy this condition, each gj, (i−1)n1+1≤j≤in1 is constructed using a different set of keys. In addition, any Z storage nodes should contain linearly independent packets. Therefore, the second condition for constructing's is that any set of gi's selected from any Z storage nodes should create a linearly independent set. In the following, the details of a method used to satisfy these two conditions in a cost-efficient way is described.
The set of gi's for each storage is constructed one by one sequentially from storage sZ+1 until the last storage sN. To construct the set of {gj|1≤j≤n1} that will be used in blocks of the packet for storage sZ+1, linear combinations of the keys stored in the first Z storage nodes are formed. More specifically, g1 used in the first block of packet Ps
Similarly, the set of {gj|n1+1≤j≤n1+nZ+2} are constructed that will be used in blocks of the packet for storage sZ+2. However, the coefficients should be selected such that the constructed gj's are linearly independent from the gi's constructed for packet Ps
Use of the second row of this matrix for creating gi's required for generating packet Ps
The Vandermonde matrix parameters, R and q* will need to be selected. The smaller the value of q*, the computational complexity for creating the linear combinations is lower. However, if the value of q* is too small, the Vandermonde matrix cannot create enough number of independent rows. Therefore, q* is selected to minimize the complexity subject to the constraint that R linearly independent rows can be created with the defined Vandermonde matrix V. Note that as many rows of the Vandermonde matrix as needed are created.
Continuing the previous example, first Z=2 storage nodes store keys only and each of the remaining storage nodes s3, s4, s5, store blocks each with size λq where each block is formed of two parts. The first part is a file partition and the second part is a function of keys. The minimum required number of key blocks to keep data confidential from any Z=2 storage nodes is restricted by the number of blocks stored in the storage with the largest memory size. More specifically, ZnZ+1=10 is the minimum required number of key blocks, where the size of each block is λq. Therefore, 10 random numbers in q is generated, where q=2λq and each generated number is put into a block, as shown in
In order to construct the second parts of packets stored in the remaining 3 storage nodes, linearly independent gi's are created. More specifically, for each storage si, 3≤i≤5, all gi's used in all blocks should be linearly independent. The set of gi's that are used in storage nodes s3, s4, and s5 are G1={g1, g2, g3, g4, g5}, G2={g6, g7, g8, g9}, and G3={g10, g11, g12}, respectively. To keep data confidential from each storage node, each of these sets should be a linearly independent set. To satisfy this condition, for each set, different gi's are constructed using different keys. In addition, any Z=2 storage nodes should contain linearly independent packets. Therefore, the second condition for constructing gi's is that a set created as a union of any selected 2 storage nodes is a linearly independent set. In other words, each of the sets G1∪G2={g1, g2, g3, g4, g5, g6, g7, g8, g9}, G1∪G3={g1, g2, g3, g4, g5, g10, g11, g12}, and G2∪G3={g6, g7, g8, g9, g10, g11, g12}, should be a linearly independent set. In the following, the details are given of a method to satisfy these two conditions in a cost-efficient way. The G1's are constructed one by one sequentially from i=1 till i=3. To construct G1, use linear combinations of the keys stored in the first 2 storage nodes. More specifically, g1 used in the first block of packet Ps
Similarly, G2 is constructed for use in blocks of packets for storage s4. However, the coefficients should be selected such that G2 is linearly independent from G1. For this purpose, Vandermonde matrix in 7, is defined as shown in
Packet Generation
The first Z storage nodes are assigned to store keys. These are the extra bits of information that need to be stored to keep data F confidential from any Z storage nodes. More specifically, packet Ps
Returning to the previous example, the first 2 storage nodes are assigned to store keys. This is the extra information that needs to be stored to keep data F confidential from any Z=2 storage nodes. More specifically, packets Ps
However, there is no guarantee that by having access to the data stored in any Z>2 storage nodes, no partial information about file F is revealed. For instance, by having access to the data stored in the storage nodes s3, s4, and s5, the partial information of f1+f11−2f6 can be revealed, as the collection of (i) the 1st block of Ps
In
The instructions 2608 are operable to cause the processor 2602 to select a value Z (e.g., selected by a system designer via the apparatus) such that an attacker having access to Z of the storage nodes 2614 is unable to decode a file 2616 of size F stored in the network 2612 that distributedly stores the file in more than the Z storage nodes. The instructions 2608 cause the processor 2602 to select N* of storage nodes that minimizes a cost function that includes |F|, Z, an initial data access cost CT, and a transmission and downloading cost Cd. The instructions 2608 cause the processor 2602 to initially allocate equal memory sizes from largest (with the largest available capacity) N* storage nodes to store the file and a set 2618 of linear code keys.
The instructions 2608 cause the processor 2602 to iteratively determine a first cost of adding more storage nodes to N* storage nodes and a second cost of allocating more memory from each of the N* storage nodes. Based on a minimal cost determined from the iterative determinations, the instructions 2608 cause the processor 2602 to store the file 2616 and linear code keys 2618 in N≥N* of the storage nodes, the keys 2618 stored in a first through Zth storage nodes and independent linear combinations of the keys 2618 and partitions of the file 2616 stored in a Z+1th to Nth storage nodes.
The various embodiments described above may be implemented using circuitry, firmware, and/or software modules that interact to provide particular results. One of skill in the arts can readily implement such described functionality, either at a modular level or as a whole, using knowledge generally known in the art. For example, the flowcharts and control diagrams illustrated herein may be used to create computer-readable instructions/code for execution by a processor. Such instructions may be stored on a non-transitory computer-readable medium and transferred to the processor for execution as is known in the art. The structures and procedures shown above are only a representative example of embodiments that can be used to provide the functions described hereinabove.
The foregoing description of the example embodiments has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the embodiments to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. Any or all features of the disclosed embodiments can be applied individually or in any combination are not meant to be limiting, but purely illustrative. It is intended that the scope of the invention be limited not with this detailed description, but rather determined by the claims appended hereto.
Number | Name | Date | Kind |
---|---|---|---|
5742814 | Balasa | Apr 1998 | A |
9015853 | Stefanov et al. | Apr 2015 | B2 |
9961142 | Li et al. | May 2018 | B2 |
10965448 | Raman | Mar 2021 | B1 |
20190036648 | Yanovsky et al. | Jan 2019 | A1 |
20190361988 | Petters | Nov 2019 | A1 |
20200153627 | Wentz | May 2020 | A1 |
Number | Date | Country |
---|---|---|
WO 2013191658 | Dec 2013 | WO |
Entry |
---|
Yun Tian et al, “A Secure File Allocation Algorithm for Heterogeneous Distributed Systems”, 2011 International Conference on Parallel Processing Workshops, pp. 168-175, IEEE 2011. |
Number | Date | Country | |
---|---|---|---|
20210133151 A1 | May 2021 | US |