Data-state-describing data structures

Abstract
Various method and system embodiments of the present invention are directed to data and data-state-describing data structures employed in storage components of a distributed data-storage system. In one embodiment of the present invention, a hierarchical data structure stores the data state of a component data-storage system of a distributed data-storage system. In another embodiment of the present invention, a data-block address, stored in a computer-readable memory within a component data-storage system of a distributed data-storage system, includes a block identifier and additional data fields that serve to uniquely specify the addressed data block when multiple copies of the data block are stored in the component data-storage system under different redundancy schemes.
Description

BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 shows a high level diagram of a FAB mass-storage system according to one embodiment of the present invention.



FIG. 2 shows a high-level diagram of an exemplary FAB brick according to one embodiment of the present invention.



FIGS. 3-4 illustrate the concept of data mirroring.



FIG. 5 shows a high-level diagram depicting erasure coding redundancy.



FIG. 6 shows a 3+1 erasure coding redundancy scheme using the same illustration conventions as used in FIGS. 3 and 4.



FIG. 7 illustrates the hierarchical data units employed in a current FAB implementation that represent one embodiment of the present invention.



FIGS. 8A-D illustrate a hypothetical mapping of logical data units to physical disks of a FAB system that represents one embodiment of the present invention.



FIG. 9 illustrates, using a different illustration convention, the logical data units employed within a FAB system that represent one embodiment of the present invention.



FIG. 10A illustrates the data structure maintained by each brick that describes the overall data state of the FAB system and that represents one embodiment of the present invention.



FIG. 10B illustrates a brick segment address that incorporates a brick role according to one embodiment of the present invention.



FIGS. 11A-H illustrate various different types of configuration changes reflected in the data-description data structure shown in FIG. 10A within a FAB system that represent one embodiment of the present invention.



FIGS. 12-18 illustrate the basic operation of a distributed storage register.



FIG. 19 shows the components used by a process or processing entity Pi that implements, along with a number of other processes and/or processing entities, Pj≠i a distributed storage register.



FIG. 20 illustrates determination of the current value of a distributed storage register by means of a quorum.



FIG. 21 shows pseudocode implementations for the routine handlers and operational routines shown diagrammatically in FIG. 19.



FIG. 22 shows modified pseudocode, similar to the pseudocode provided in FIG. 17, which includes extensions to the storage-register model that handle distribution of segments across bricks according to erasure coding redundancy schemes within a FAB system that represent one embodiment of the present invention.



FIG. 23 illustrates the large dependence on timestamps by the data consistency techniques based on the storage-register model within a FAB system that represent one embodiment of the present invention.



FIG. 24 illustrates hierarchical time-stamp management that represents one embodiment of the present invention.



FIGS. 25-26 provide pseudocode for a further extended storage-register model that includes the concept of quorum-based writes to multiple, active configurations that may be present due to reconfiguration of a distributed segment within a FAB system that represent one embodiment of the present invention.



FIG. 27 shows high-level pseudocode for extension of the storage-register model to the migration level within a FAB system that represents one embodiment of the present invention.



FIG. 28 illustrates the overall hierarchical structure of both control processing and data storage within a FAB system that represents one embodiment of the present invention.


Claims
  • 1. A data structure stored in a computer readable medium within a component data-storage system of a distributed data storage system, the data structure including a hierarchy of data-structure elements including: a virtual-disk table containing entries each referencing a virtual-disk-image table and each representing a virtual disk that may be stored and replicated on one or more component data-storage systems; anda number of virtual-disk-image tables, each table referenced by a virtual-disk-table entry and representing one replicate of a virtual disk.
  • 2. The data structure of claim 1wherein the data structure further includes a number of segment configuration nodes, each segment configuration node representing a virtual disk segment distributed according to one or two redundancy schemes over a number of component data-storage systems; andwherein each virtual-disk-image table contains entries that each references a segment configuration node.
  • 3. The data structure of claim 2wherein each segment configuration node is referenced by one or more virtual-disk-image-table entries;wherein the data structure further includes a number of cgrp data-structure elements, each cgrp data-structure element representing a virtual disk segment distributed in one or more distribution configurations according to a redundancy scheme over a number of component data-storage systems; andwherein each segment configuration node references one or two cgrp data-structure elements.
  • 4. The data structure of claim 3wherein each cgrp data-structure element is referenced by one or more segment configuration nodes;wherein the data structure further includes a number of cfg data-structure elements, each cfg data-structure element representing a virtual disk segment distributed in a distribution configuration according to a redundancy scheme over a number of component data-storage systems; andwherein each cgrp data-structure element references one or more cfg data-structure elements.
  • 5. The data structure of claim 4wherein each cfg data-structure element is referenced by a cgrp data-structure element;wherein each cfg data-structure includes indications of component-data-storage-system health for each of the number of component data-storage systems over which the virtual disk segment is distributed; andwherein each cfg data-structure element references, or includes, a layout data-structure element.
  • 6. The data structure of claim 5 wherein a layout data-structure element includes an indication of the redundancy scheme according to which the virtual disk segment is distributed and a list of the component data-storage systems over which the virtual disk segment is distributed.
  • 7. The data structure of claim 6 wherein the list of the component data-storage systems over which the virtual disk segment is distributed is ordered, the position of each component data-storage system in the list indicative of a role played by the component data-storage system in an erasure coding redundancy scheme.
  • 8. The data structure of claim 7 wherein one or more of the number of virtual-disk-image tables, the number of segment configurations nodes, the number of cgrp data-structure elements, and the number of cfg data-structure elements include reference counts to facilitate garbage collection and coordination of data-structure-driven operations.
  • 9. The data structure of claim 7 wherein one or more of the number of virtual-disk-image tables, the number of segment configurations nodes, the number of cgrp data-structure elements, and the number of cfg data-structure elements include references to additional data-structure elements to facilitate bottom-up and lateral traversals of data-structure elements within the data structure.
  • 10. The data structure of claim 1 wherein the computer readable medium is a volatile memory within the component data-storage system.
  • 11. The data structure of claim 1 wherein the computer readable medium is a non-volatile memory within the component data-storage system.
  • 12. The data structure of claim 1 wherein the computer readable medium is a mass-storage device within the component data-storage system.
  • 13. The data structure of claim 1 wherein the data structure or portions of the data structure are stored in two or more different computer readable media.
  • 14. A method for representing a data state of a distributed data-storage system comprising a number of networked component data-storage systems, the method comprising: storing a hierarchical data structure in a computer-readable medium representing the data state on each of the data-storage systems; andpropagating updates to a hierarchical data structure with a component data-storage system to the hierarchical data structures of all other component data-storage systems of the distributed data-storage system so that all component data-storage systems maintain a current data state for the entire distributed data-storage system.
  • 15. The method of claim 14 wherein each hierarchical data structure comprises: a virtual-disk table containing entries, each entry referencing a virtual disk image;a number of virtual-disk-image tables, each table referenced by a virtual-disk-table entry, and each virtual-disk-image table containing entries that reference a segment configuration node;a number of segment configurations nodes, each segment configuration node referenced by one or more virtual-disk-image-table entries, and each segment configuration node referencing one or two cgrp data-structure elements;a number of cgrp data-structure elements, each cgrp data-structure element referenced by one or more segment configuration nodes, and each cgrp data-structure element referencing one or more cfg data-structure elements; anda number of cfg data-structure elements, each cfg data-structure element referenced by a cgrp data-structure element, and each cfg data-structure element referencing, or including, a layout data-structure element.
  • 16. The method of claim 14wherein a virtual-disk-table entry represents a virtual disk that may be stored and replicated on one or more component data-storage systems;wherein a virtual-disk-image-table entry represents one replicate of a virtual disk, generally stored on a subset of geographically co-located component data-storage systems;wherein a segment configuration node represents a virtual disk segment distributed according to one or two redundancy schemes over a number of component data-storage systems;wherein a cgrp data-structure element represents a virtual disk segment distributed in one or more distribution configurations according to a redundancy scheme over a number of component data-storage systems; andwherein a cfg data-structure element represents a virtual disk segment distributed in a distribution configuration according to a redundancy scheme over a number of component data-storage systems.
  • 17. Computer instructions stored within a computer-readable medium that implement the method of claim 14.
  • 18. A data structure stored in a computer readable medium within a component data-storage system of a distributed data storage system, the data structure including a hierarchy of data-structure elements including: a number of configuration-group data-structure elements, each configuration-group data-structure element representing a segment of data blocks distributed according to a redundancy scheme over one or more component data-storage systems; anda number of segment configuration nodes, each segment configuration node representing a segment of data blocks distributed according to one redundancy scheme over one or more component data-storage systems, and correspondingly referencing one configuration-group data-structure element, or representing a segment of data blocks distributed over one or more component data-storage systems according to first and second redundancy schemes, and correspondingly referencing two configuration-group data-structure elements.
  • 19. The data structure of claim 18 wherein a segment configuration node referencing two configuration-group data-structure elements represents a segment of data blocks undergoing migration from the first redundancy scheme to the second redundancy scheme.
  • 20. The data structure of claim 18 further including a number of configuration data-structure elements, each configuration data-structure element representing a segment of data blocks distributed according to one redundancy scheme over one or more component data-storage systems, each configuration data-structure element referenced by a configuration-group data-structure element.
  • 21. The data structure of claim 20 wherein a configuration-group data-structure element that references first and second configuration data-structure elements represents a segment undergoing reconfiguration from a first configuration to a second configuration.
  • 22. The data structure of claim 20 wherein each configuration data-structure element includes indications of the component data-storage systems over which the segment of data blocks is distributed and an indication of health for each of the component data-storage systems over which the segment of data blocks is distributed.
  • 23. The data structure of claim 18 wherein the computer readable medium is a volatile memory within the component data-storage system.
  • 24. The data structure of claim 18 wherein the computer readable medium is a non-volatile memory within the component data-storage system.
  • 25. The data structure of claim 18 wherein the computer readable medium is a mass-storage device within the component data-storage system.
  • 26. The data structure of claim 18 wherein the data structure or portions of the data structure are stored in two or more different computer readable media.
  • 27. A data-block address stored in a computer-readable memory within a component data-storage system of a distributed data-storage system, the data block address comprising: a block identifier; andadditional data fields that serve to uniquely specify the data block when multiple copies of the data block are stored in the component data-storage system under different data-consistency schemes.
  • 28. A data-block address of claim 27 wherein the additional data fields comprise: a segment identifier;a component-data-storage-system identifier; andan indication of a component-data-storage-system role.
  • 29. The data-block address of claim 28 wherein the indication of a component-data-storage-system role further comprises: an indication of a redundancy scheme; andwhen the redundancy scheme indicated by the redundancy-scheme indication is an erasure coding redundancy scheme, an indication of a component-data-storage-system position within an ordered list of component-data-storage systems over which the segment is distributed according to the redundancy scheme, andan indication of a stripe size.
  • 30. The data-block address of claim 29 wherein the indication of a redundancy scheme, the indication of a component-data-storage-system position within an ordered list of component-data-storage systems, and the indication of a stripe size are encoded as enumeration values or in other encodings, rather than as integers, in order to conserve storage space.
  • 31. A method for addressing data blocks in a distributed data-storage system composed of networked component data-storage systems over which data segments are distributed according to a redundancy scheme, the method comprising: allocating a data-block address in a computer-readable medium; andplacing into the allocated data-block address a block identifier and other values that serve to uniquely specify the data block when multiple copies of the data block are stored in the component data-storage system under different data-consistency schemes.
  • 32. The method of claim 31 wherein placing into the allocated data-block address other values that serve to uniquely specify the data block when multiple copies of the data block are stored in the component data-storage system under different data-consistency schemes further includes placing into the allocated data-block address a segment identifier, a component-data-storage-system identifier, and an indication of a component-data-storage-system role.
  • 33. The method of claim 32 further including encoding the indication of a redundancy scheme, the indication of a component-data-storage-system position within an ordered list of component-data-storage systems, and the indication of a stripe size as enumeration values or in other encodings, rather than as integers, in order to conserve storage space.
  • 34. The method of claim 31 wherein placing into the allocated data-block address an indication of a component-data-storage-system role further comprises: placing into the allocated data-block address an indication of a redundancy scheme and,when the redundancy scheme indicated by the redundancy-scheme indication is an erasure coding redundancy scheme, an indication of a component-data-storage-system position within an ordered list of component-data-storage systems over which the segment is distributed according to the redundancy scheme, and an indication of a stripe size.
  • 35. The method of claim 34 further including passing the data-block address, or a reference to the data block, to a routine that retrieves the block identifier, the segment identifier, and the indication of the component-data-storage-system role, and uses the retrieved block identifier, the segment identifier, and the indication of the component-data-storage-system role to access table entries in order to determine a physical address of the data block within a data-storage device of a component data-storage system.
  • 36. The method of claim 31 further including passing the data-block address, or a reference to the data block, to a routine that stores the data-block address in a computer-readable memory to describe a data block.
  • 37. Computer instructions encoded in a computer-readable memory that implement the method of claim 31.
  • 38. A method for representing the data state of a distributed data storage system, the method comprising: a means for representing a number of virtual disks that may be stored and replicated on one or more component data-storage systems of the distributed data-storage system;a means for representing each of a number of virtual disk segments distributed according to one or two redundancy schemes over a number of component data-storage systems and each included in a virtual disk; anda means for representing each of a number of virtual disk segments distributed in one or more distribution configurations according to a redundancy scheme over a number of component data-storage systems and each included in a virtual disk.