1. Field of the Invention
This invention relates generally to the field of data storage and in particular to distributed data storage.
2. Description of the Related Art
The explosive growth of the Internet has ushered in a new area in which information is exchanged and accessed on a constant basis. In response to this growth, there has been an increase in the size of data that is being stored. Users are demanding more than standard HTML documents, wanting access to a variety of data, such as, audio data, video data, image data, and programming data. Thus, there is a need for data storage that can accommodate large sets of data, while at the same time provide fast and reliable access to the data.
One response has been to utilize single storage devices which may store large quantities of data but have difficulties providing high throughput rates. As data capacity increases, the amount of time it takes to access the data increases as well. Processing speed and power has improved, but disk I/O (Input/Output) operation performance has not improved at the same rate making I/O operations inefficient, especially for large data files. One solution has been to break up large data files and store them in distributed systems. However, such systems store a fixed amount of data and are often costly to replace.
The embodiments disclosed herein generally relate to distributed data storage.
In one embodiment, a storage system is provided. The storage system includes a plurality of n storage containers, x1, x2, to xn, configured to store logical data and data protection data, wherein: n is greater than 1; the size of x1≦the size of x2≦ . . . the size of xn-1≦the size of xn and the size of x1<the size of xn; the plurality of n storage containers utilize more than ((n−m)*size of x1) for storing logical data, where m is the number of failed storage containers the system can handle; and the logical data and data protection data may include striped data and mirrored data.
In a further embodiment, a storage system is provided. The storage system includes a plurality of n storage containers, x1, x2, to xn, configured to store logical data and data protection data, wherein: n is greater than 1; the size of x1≦the size of x2≦ . . . the size of xn-1≦the size of xn and the size of x1<the size of xn; the plurality of n storage containers utilize more than ((n−m)*size of x1) for storing logical data, where m is the number of failed storage containers the system can handle; and the storage containers are locally accessed disk drives.
In an additional embodiment, a storage system is provided. The storage system includes a plurality of n storage containers, x1, x2, to xn, configured to store logical data and data protection data, wherein: n is greater than 1; the size of x1≦the size of x2≦ . . . the size of xn-1≦the size of xn and the size of x1<the size of xn; the plurality of n storage containers utilize more than (n*size of x1) for storing physical data; and the logical data and data protection data may include striped data and mirrored data.
In a further embodiment, a method of storing data on heterogeneous storage containers is provided. The method includes receiving a total number of storage containers; receiving a minimum number of protection blocks; determining a first protection scheme; storing a first plurality of stripes of data across all of the storage containers at the first protection until the smallest container of all of the storage containers is full; determining a second protection scheme; and storing a second plurality of stripes of data across the non-full storage containers at the second protection until the smallest container of the non-full storage containers is full.
For purposes of this summary, certain aspects, advantages, and novel features of the invention are described herein. It is to be understood that not necessarily all such advantages may be achieved in accordance with any particular embodiment of the invention. Thus, for example, those skilled in the art will recognize that the invention may be embodied or carried out in a manner that achieves one advantage or group of advantages as taught herein without necessarily achieving other advantages as may be taught or suggested herein.
These and other features will now be described with reference to the drawings summarized above. The drawings and the associated descriptions are provided to illustrate the embodiments of the invention and not to limit the scope of the invention. Throughout the drawings, reference numbers may be re-used to indicate correspondence between referenced elements. In addition, the first digit of each reference number generally indicates the figure in which the element first appears.
Systems, methods, processes, and data structures which represent one embodiment of an example application of the invention will now be described with reference to the drawings. Variations to the systems, methods, processes, and data structures which represent other embodiments will also be described.
In a traditional RAID system, a single controller is attached to a set of drives and the controller stores data on the drives. These drives are of the same size and they always store the same amount of data. Such drives are often referred to as homogeneous drives since they are the same size throughout the system. While homogeneous drives may be easier to implement since they are of the same size, they do not allow for much flexibility such as, for example, when more space is needed and/or part of a drive becomes unavailable.
Embodiments of the present invention provide systems and methods for using heterogeneous containers where the available space in the containers is of two or more different sizes. In some embodiments, the heterogeneous containers may store some data under one protection scheme and other data under one or more other data protection schemes. This allows for use of more of the container space.
In some embodiments, the heterogeneous containers may be of different sizes and/or may have a different amount of available space. For example, one system of heterogeneous containers includes six containers each of size X, wherein the first three containers have only 75% of their space available whereas the last three containers have 100% of their space available. In another example, one system of heterogeneous containers includes 20 containers, the first 3 of size 250 G, the next 8 of size 500 G, the next 7 of size 110 G, and the last 2 of size 2064 G with all of the containers having 100% of their space available. In a further example, one system of heterogeneous containers includes three distributed nodes, the first node of size 3.6 TB with 70% of its space available, the second node of size 3.6 TB with 100% of its space available, and a third node of size 4.8 TB with 80% of its space available.
In some embodiments, the heterogeneous containers store distributed data that can be protected using one or more types of data protection. For example, a first set of data may be protected at 5+3, a second set of data may be protected at 4+2, a third set of data may be protected at 3+1, and a fourth set of data may be mirrored at level 2×.
Moreover, in some embodiments, the system is dynamic such that containers can be added and/or grown without having to fully reconfigure the system.
A. Storage Apparatus
In one embodiment, the storage apparatus 110 include two or more storage containers 115. The storage apparatus 110 of
In some embodiments, part of a container may be unavailable. There are many reasons why a container may not be available such as, for example, a part of a container may be corrupted, reserved for other use by the system, disconnected from the system, a drive may be lost, and so forth.
It is recognized that the storage containers may store a variety of data including file data, metadata, and data protection data. In the type of file data may include static data, data streams, executable file data, and so forth.
It is recognized that there may be other storage containers that are not part of the set. For example, while there may be a set of six heterogeneous containers, there maybe be other containers that communicated with the system or are part of the system.
B. Storage Module
In one embodiment, the storage module 140 stores data in one or more storage containers 115 of the storage apparatus 110. In addition, in some embodiments, the storage module 140 stores the data using one or more data protection policies and/or levels. In one embodiment, the storage module 140 communicates directly with the storage apparatus 110, whereas in other embodiments, some or all of the communication between the storage module 140 and the storage apparatus 110 is via a communications medium. In one embodiment, the storage module stores data by using all containers in the set for each stripe until the smallest container(s) is filled, using the remaining containers for the subsequent stripes until the next smallest container(s) is filled and so forth until there are not enough containers to maintain a minimum level of protection. This and other embodiments of storing data are discussed further below.
In some embodiments, the storage module stores data based on the data that is available when the data is being stored. This flexibility allows the system to add, remove, and/or change containers to the system without having to stop and fully reconfigure the system. In addition, if the capacity of a container changes, such as, for example, if a sector of a container becomes unreadable, the system can then continue to store date on the remaining area of the container as well as on the other containers even though the container is now of a new, different size.
The word module refers to logic embodied in hardware or firmware, or to a collection of software instructions, possibly having entry and exit points, written in a programming language, such as, for example, C or C++. A software module may be compiled and linked into an executable program, installed in a dynamically linked library, or may be written in an interpreted programming language such as, for example, BASIC, Perl, or Python. It will be appreciated that software modules may be callable from other modules or from themselves, and/or may be invoked in response to detected events or interrupts. Software instructions may be embedded in firmware, such as an EPROM. It will be further appreciated that hardware modules may be comprised of connected logic units, such as gates and flip-flops, and/or may be comprised of programmable units, such as programmable gate arrays or processors. The modules described herein are preferably implemented as software modules, but may be represented in hardware or firmware. Moreover, although in some embodiments a module may be separately compiled, in other embodiments a module may represent a subset of instructions of a separately compiled program, and may not have an interface available to other logical program units.
The storage module 140 may run on a variety of computer systems such as, for example, a computer, a server, a smart storage unit, and so forth. In one embodiment, the computer may be a general purpose computer using one or more microprocessors, such as, for example, an Intel® Pentium® processor, an Intel® Pentium® II processor, an Intel® Pentium® Pro processor, an Intel® Pentium® IV processor, an Intel® Pentium® D processor, an Intel® Core™ processor, an xx86 processor, an 8051 processor, a MIPS processor, a Power PC processor, a SPARC processor, an Alpha processor, and so forth. The computer may run a variety of operating systems that perform standard operating system functions such as, for example, opening, reading, writing, and closing a file. It is recognized that other operating systems may be used, such as, for example, Microsoft® Windows® 3.X, Microsoft® Windows 98, Microsoft® Windows® 2000, Microsoft® Windows® NT, Microsoft® Windows® CE, Microsoft® Windows® ME, Microsoft® Windows® XP, Palm Pilot OS, Apple® MacOS®, Disk Operating System (DOS), UNIX, IRIX, Solaris, SunOS, FreeBSD, Linux®, or IBM® OS/2® operating systems.
C. Communications Medium
The communication medium 130 may be one or more networks, including, for example, the Internet, a local area network (LAN), a wide area network (WAN), a wireless network, a wired network, an intranet, a bus, and so forth.
D. Data Protection
It is recognized that the heterogeneous storage system may utilize one or more data protection policies and/or levels. For example, the heterogeneous storage system may implement one or more error correcting codes. These codes include a code “in which each data signal conforms to specific rules of construction so that departures from this construction in the received signal can generally be automatically detected and corrected. It is used in computer data storage, for example in dynamic RAM, and in data transmission.” (http://en.wikipedia.org/wiki/Error_correcting_code). Examples of error correction code include, but are not limited to, Hamming code, Reed-Solomon code, Reed-Muller code, Binary Golay code, convolutional code, and turbo code. In some embodiments, the simplest error correcting codes can correct single-bit errors and detect double-bit errors, and other codes can detect or correct multi-bit errors.
In addition, the error correction code may include forward error correction, erasure code, fountain code, parity protection, and so forth. “Forward error correction (FEC) is a system of error control for data transmission, whereby the sender adds redundant to its messages, which allows the receiver to detect and correct errors (within some bound) without the need to ask the sender for additional data.” (http://en.wikipedia.org/wiki/forward error correction). Fountain codes, also known as rateless erasure codes, are “a class of erasure codes with the property that a potentially limitless sequence of encoding symbols can be generated from a given set of source symbols such that the original source symbols can be recovered from any subset of the encoding symbols of size equal to or only slightly larger than the number of source symbols.” (http://en.wikipedia.org/wiki/Fountain code). “An erasure code transforms a message of n blocks into a message with >n blocks such that the original message can be recovered from a subset of those blocks” such that the “fraction of the blocks required is called the rate, denoted r (http://en.wikipedia.org/wiki/Erasure code). “Optimal erasure codes produce n/r blocks where any n blocks is sufficient to recover the original message.” (http://en.wikipedia.org/wiki/Erasure code). “Unfortunately optimal codes are costly (in terms of memory usage, CPU time or both) when n is large, and so near optimal erasure codes are often used,” and “[t]hese require (1+ε)n blocks to recover the message. Reducing ε can be done at the cost of CPU time.” (http://en.wikipedia.ori/wiki/Erasure code).
The data protection may include other error correction methods, such as, for example, Network Appliance's RAID double parity methods, which includes storing data in horizontal rows, calculating parity for data in the row, and storing the parity in a separate row parity disk, along with other double parity methods, diagonal parity methods, and so forth.
In addition, for each protection policy, there may be one or more protection schemes. For example, a protection policy of “n+m,” there may be several levels of protection, such as, for example, n1+m, n2+m, n3+m, and so forth. As another example, for an n+1 protection policy, data may be protected at the following levels: 3+1, 2+1, and 2×. The system may include more than one data protection policy and/or level, referred to as protection schemes.
A. Example of Multiple Protection Schemes
B. Example of Multiple Protection Schemes Using Parity Protection
C. Distributed File System
In some embodiments, the systems and methods disclosed herein may be used to stored files of a distributed file system. As used herein, a file is a collection of data stored in one unit under a filename. Embodiments of a distributed file system suitable for accommodating embodiments of heterogeneous storage system disclosed herein are disclosed in U.S. patent application Ser. No. 10/007,003, titled, “Systems And Methods For Providing A Distributed File System Utilizing Metadata To Track Information About Data Stored Throughout The System,” filed Nov. 9, 2001 which claims priority to Application No. 60/309,803, entitled “Systems And Methods For Providing A Distributed File System Utilizing Metadata To Track Information About Data Stored Throughout The System,” filed Aug. 3, 2001, U.S. Pat. No. 7,156,524 entitled “Systems And Methods For Providing A Distributed File System Incorporating A Virtual Hot Spare,” filed Oct. 25, 2002, and U.S. patent application Ser. No. 10/714,326 entitled “Systems And Methods For Restriping Files In A Distributed File System,” filed Nov. 14, 2003, which claims priority to Application No. 60/426,464, entitled “Systems And Methods For Restriping Files In A Distributed File System,” filed Nov. 14, 2002, all of which are hereby incorporated herein by reference in their entirety.
For example, if there are 4 containers, C1, C2, C3, C4, of size 3, 3, 4, and 6, the minimum amount of error correction is 1, and the file size is 12 blocks, the blocks will be stored as follows: the first nine blocks of the file and three parity blocks will be stored on containers C1, C2, C3, C4 at protection 3+1; the tenth block of the file and one parity block will be stored on containers C3, C4 at protection 1+1; and the eleventh and twelfth block will not be stored on the containers because while the remaining space can store the last two blocks, it cannot store the last two blocks with the minimum protection.
While
For example, if there are 4 containers, C1, C2, C3, C4, of size 3, 3, 4, and 6, the minimum amount of error correction is 1, and the file size is 12 blocks. In
While
While certain embodiments of the invention have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the present invention. Accordingly, the breadth and scope of the present invention should be defined in accordance with the following claims and their equivalents.
Some of the figures and descriptions relate to an embodiment of the invention wherein the environment is that of a distributed system. The present invention is not limited by the type of environment in which the systems, methods, processes and data structures are used. The systems, methods, structures, and processes may be used in other environments, such as, for example, other distributed systems, the Internet, the World Wide Web, a private network for a hospital, a broadcast network for a government agency, an internal network of a corporate enterprise, an intranet, a local area network, a wide area network, a wired network, a wireless network, and so forth. It is also recognized that in other embodiments, the systems, methods, structures and processes may be implemented as a single module and/or implemented in conjunction with a variety of other modules and the like.
It is also recognized that the term “remote” may include data, objects, devices, components, and/or modules not stored locally, that is not accessible via the local bus or data stored locally and that is “virtually remote.” Thus, remote data may include a device which is physically stored in the same room and connected to the user's device via a network. In other situations, a remote device may also be located in a separate geographic area, such as, for example, in a different location, country, and so forth.
The above-mentioned alternatives are examples of other embodiments, and they do not limit the scope of the invention. It is recognized that a variety of data structures with various fields and data sets may be used. In addition, other embodiments of the flow charts may be used.