1. Field of the Invention
The present invention is directed toward the field of data storage, and more particularly toward a distributed network data storage system.
2. Art Background
There is an increasing demand for systems that store large amounts of data. Many companies struggle to provide scalable, cost-effective storage solutions for large amounts of data stored in files (e.g., terabytes of data). One type of prior art system used to store data for computers is known as network attached storage (“NAS”). In a NAS configuration, a computer, such as a server, is coupled to physical storage, such as one or more hard disk drives. The NAS server is accessible over a network. In order to access the storage, the client computer submits requests to the server to store and retrieve data.
Conventional NAS technology has several inherent limitations. First, NAS systems are severely impacted by their fundamental inability to scale performance and capacity. Current NAS systems only scale performance within the limits of a single NAS server with a single network connection. Thus, a single NAS server can only scale capacity to a finite number of disks attached to that NAS server. These fundamental limitations of current file storage systems create a variety of challenges. First, customers must use multiple NAS systems to meet capacity and performance requirements. The use of multiple NAS systems requires the customer to manage multiple file systems and multiple NAS system images. These attempts lead to inefficient utilization of storage assets because files must be manually distributed across multiple NAS systems to meet overall capacity and performance requirements. Invariably, this leaves pockets of unused capacity in the multiple NAS systems. Moreover, frequently accessed files, sometimes referred to as hot files, may only be served by a single NAS server, resulting in a bottleneck that impacts performance of the storage system. These issues result in substantially higher management costs to the end-user as well as high acquisition costs to purchase proprietary NAS systems.
A storage area network (“SAN”) is another configuration used to store large amounts of data. In general, a SAN configuration consists of a network of disks. Clients access disks over a network. Using the SAN configuration, the client typically accesses each individual disk as a separate entity. For example, a client may store a first set of files on a first disk in a network, and store a second set of files on a second disk in the SAN system. Thus, this technique requires the clients to manage file storage across the disks on the storage area network. Accordingly, the SAN configuration is less desirable because it requires the client to specifically manage storage on each individual disk. Accordingly, it is desirable to develop a system that manages files with a single file system across multiple disks.
A distributed data storage system stores a single image file system across a plurality of physical storage volumes. One or more clients communicate with the distributed data storage system through a network. The distributed data storage system includes a plurality of storage nodes. Each storage node services requests for storage operations on the files stored on the physical storage volumes. In one embodiment, the physical storage is direct attached storage. For this embodiment, at least one physical storage volume is directly coupled to each storage node. In another embodiment, the physical storage volumes are coupled to the storage nodes through a storage area network (“SAN”).
To conduct a storage operation, including read and write operations, a client transmits a request over the network for a file identified in the file system. One of the storage nodes is selected to process the request. In one embodiment, the distributed data storage system contains a load balancing switch that receives the request from the client and that selects one of the client nodes to process the storage operation. To process the request, the storage node accesses at least one of the physical volumes and transmits a response for the storage operation to the client.
The disclosure of U.S. Provisional Patent Application No. 60/419,778, filed Oct. 17, 2002, entitled “A Distributed Storage System”, is hereby expressly incorporated herein by reference.
The nodes (1-n) are coupled to a network (150). Also coupled to the network are “m” clients, where “m” is an integer value greater than or equal to one. The network may be any type of network that utilizes any well-known protocol (e.g., TCP/IP, UDP, etc.). Also, as shown in
In general, the distributed NAS system of the present invention creates a single system image that scales in a modular way to hundreds of terabytes and several hundred thousand operations per second. In one embodiment, to minimize costs, the distributed NAS system software runs on industry standard hardware and operates with industry standard operating systems. The distributed NAS system allows flexible configurations based on specific reliability, capacity, and performance requirements. In addition, the distributed NAS system scales without requiring any changes to end user behavior, client software or hardware. For optimal performance, in one embodiment, the distributed NAS system distributes client load evenly so as to eliminate a central control point vulnerable to failure or performance bottlenecks. The distributed NAS system permits storage capacity and performance to scale without disturbing the operation of the system. To achieve these goals, the distributed NAS system utilizes a distributed file system as well as a volume manager. In one embodiment, each node (or server) consists of, in addition to standard hardware and operating system software, a distributed file system manager (165, 175 and 185) and a volume manager (160, 170 and 180) for nodes 1, 2 and n, respectively.
The nodes of the distributed NAS system communicate with one or more hard disk drives.
In another embodiment, the nodes of the distributed NAS system utilize disks coupled through a network (e.g., storage area network “SAN”).
In general, index nodes, referred to as “inodes” uniquely identify files and directories. Inodes map files and directories of a file system to physical locations. Each inode is identified by a number. For a directory, an inode includes a list of file names and sub directories, if any, as well as a list of data blocks that constitute the file or subdirectory. The inode also contains size, position, etc. of the file or directory. When a selected node (NAS server) receives a request from the client to service a particular inode, the selected node performs a lookup to obtain the physical location of the corresponding file or directory in the physical media.
As an initial procedure, a client of the distributed NAS system mounts the distributed file system.
The selected node (file system manager) obtains the inode for the file system root directory, and generates a client file handle to the root directory (block 530,
The file handle, a client side term, is a unique identifier the client uses to access a file or directory in the distributed file system. In one embodiment, the distributed file system translates the file handle into an inode. In addition, a file handle may include the time and date information for the file/directory. However, any type of file handle may be used as long as the file handle uniquely identifies the file or directory.
The selected node (the node processing the client requests) generates a mount table (block 540,
In one embodiment, the file system for the distributed NAS is a high-performance distributed file system. The file system fully distributes both namespace and data across a set of nodes and exports a single system image for clients, applications and administrators. As a multi-node system, the file system acts as a highly scalable, high-performance file server with no single point of failure. As a storage medium, the file system utilizes a single shared disk array. It harnesses the power of multiple disk arrays connected either via a storage area network or directly to network servers. The file system is implemented entirely in user space, resulting in a lightweight and portable file system. In one embodiment, the file system provides 64-bit support to allow very large file system sizes.
The volume manager (160, 170 and 180,
The volume manager consists of three parts: logical volumes, volume groups, and physical volumes. Each layer has particular properties that contribute to the capabilities of the system. The distributed volume group is the core component of the system. A volume group is a virtualized collection of physical volumes. In its simplest form, a distributed volume group may be analogized to a special data container with reliability properties. A volume group has an associated level of reliability (e.g., RAID level). For example, a distributed volume group may have similar reliability characteristics to traditional RAID 0,1 or 5 disk arrays. Distributed volume groups are made up of any number, type or size of physical volumes.
A logical volume is a logical partition of a volume group. The file systems are placed in distributed logical volumes. A logical extent is a logically contiguous piece of storage within a logical volume. A physical volume is any block device, either hardware or software, exposed to the operating system. A physical extent is a contiguous piece of storage within a physical storage device. A sector, typically 512 bytes, defines the smallest unit of physical storage on a storage device.
A physical volume is a resource that appears to the operating system as a block based storage device (e.g., a RAID device, the disk through fiber channel, or a software RAID device). A volume, either logical or physical, consists of units of space referred to as “extents.” Extents are the smallest units of contiguous storage exposed to the distributed volume manager.
The volume manager allows unprecedented flexibility and scalability in storage management, to enhance the reliability of large-scale storage systems. In one embodiment, the distributed volume manager implements standard RAID 0, 1 and 5 configurations on distributed volume groups. When created, each distributed volume group is given the reliability settings that includes stripe size and raid-set size. Stripe size, sometimes referred to as a chunk or block, is the smallest granularity of data written to an individual physical volume. Stripe sizes of 8 k, 16 k and 24 k are common. RAID-set size refers to the number of stripes between parity calculations. This is typically equal to the number of physical volumes in a volume group.
As discussed above, inodes consist of pointers to physical blocks that store the underlying data. In one embodiment, inodes are stored on disk in “ifiles.” For directories, inode files contain a list of inodes for all files and directories contained in that directory. In one embodiment, the distributed NAS system utilizes a map manager. In general, a map manager stores information to provide an association between inodes and distributed NAS nodes (servers) managing the file or directory. The map manager, a data structure, is globally stored (i.e., stored on each node) and is atomically updated. Table 1 is an example map manager used in the distributed NAS system.
For this example, the distributed NAS system contains three nodes (A, B and C). Inodes within the range from 0 to 100 are managed by nodeA. Inodes, lying within the range of 101 to 200, are managed by nodeB, and inodes, falling within the range of 201-300, are managed by nodeC.
If the client has cached the file handle for “/export”, then the client first requests a file handle for “/export/temp.” In response to the client request, the selected node (server) determines the inode for the directory/file (block 620,
With the inode, the selected node determines, from the map manager, the storage node from the directory/file (block 630,
After obtaining the appropriate lock, the selected node transmits a file handle to the client (block 665,
In response to the read request, the file system manager obtains the necessary blocks, from the volume manager, to read the file (block 675,
In general, the volume manager responds to requests from the distributed file system manager.
The volume manager determines the disk and disk offset (block 730,
For this embodiment, the volume manager calculates the node in accordance with the arrangement illustrated in Table 2. The disks are apportioned by sectors, and the offset measures the number of sectors within a disk. The volume manager obtains blocks of data from the node, disk on the node and the offset within the disk (block 740,
The selected node determines, from the map manager, the storage node from the directory/file for the associated inode (block 830,
After obtaining the appropriate lock, the selected node transmits a file handle to the client (block 865,
The client transmits data, for the write operation, and the file handle (block 875,
Although the present invention has been described in terms of specific exemplary embodiments, it will be appreciated that various modifications and alterations might be made by those skilled in the art without departing from the spirit and scope of the invention.
This application claims the benefit of U.S. Provisional Patent Application No. 60/419,778, filed Oct. 17, 2002, entitled “A Distributed Storage System.”
Number | Name | Date | Kind |
---|---|---|---|
5497422 | Tysen et al. | Mar 1996 | A |
5506984 | Miller | Apr 1996 | A |
5550986 | DuLac | Aug 1996 | A |
5692155 | Iskiyan et al. | Nov 1997 | A |
5708832 | Inniss et al. | Jan 1998 | A |
5757920 | Misra et al. | May 1998 | A |
5764972 | Crouse et al. | Jun 1998 | A |
5796952 | Davis et al. | Aug 1998 | A |
5805699 | Akiyama et al. | Sep 1998 | A |
5870537 | Kern et al. | Feb 1999 | A |
5923846 | Gage et al. | Jul 1999 | A |
5933834 | Aichelen | Aug 1999 | A |
5937406 | Balabine et al. | Aug 1999 | A |
5978577 | Rierden et al. | Nov 1999 | A |
5991542 | Han et al. | Nov 1999 | A |
6061692 | Thomas et al. | May 2000 | A |
6067545 | Wolff | May 2000 | A |
6081883 | Popelka et al. | Jun 2000 | A |
6101508 | Wolff | Aug 2000 | A |
6108155 | Tanaka et al. | Aug 2000 | A |
6128627 | Mattis et al. | Oct 2000 | A |
6141759 | Braddy | Oct 2000 | A |
6148349 | Chow et al. | Nov 2000 | A |
6170013 | Murata | Jan 2001 | B1 |
6173374 | Heil et al. | Jan 2001 | B1 |
6236999 | Jacobs et al. | May 2001 | B1 |
6256673 | Gayman | Jul 2001 | B1 |
6263402 | Ronstrom et al. | Jul 2001 | B1 |
6272584 | Stancil | Aug 2001 | B1 |
6304980 | Beardsley et al. | Oct 2001 | B1 |
6314465 | Paul et al. | Nov 2001 | B1 |
6324581 | Xu et al. | Nov 2001 | B1 |
6327614 | Asano et al. | Dec 2001 | B1 |
6351775 | Yu | Feb 2002 | B1 |
6356929 | Gall et al. | Mar 2002 | B1 |
6360306 | Bergsten | Mar 2002 | B1 |
6389420 | Vahalia et al. | May 2002 | B1 |
6389462 | Cohen et al. | May 2002 | B1 |
6393466 | Hickman et al. | May 2002 | B1 |
6405201 | Nazari | Jun 2002 | B1 |
6438125 | Brothers | Aug 2002 | B1 |
6442548 | Balabine et al. | Aug 2002 | B1 |
6487561 | Ofek et al. | Nov 2002 | B1 |
6507883 | Bello et al. | Jan 2003 | B1 |
6553376 | Lewis et al. | Apr 2003 | B1 |
6553389 | Golding et al. | Apr 2003 | B1 |
6574641 | Dawson et al. | Jun 2003 | B1 |
6611869 | Eschelbeck et al. | Aug 2003 | B1 |
6622247 | Isaak | Sep 2003 | B1 |
6651123 | Hutchison et al. | Nov 2003 | B1 |
6654772 | Crow et al. | Nov 2003 | B1 |
6704838 | Anderson | Mar 2004 | B2 |
6718347 | Wilson | Apr 2004 | B1 |
6782389 | Chrin et al. | Aug 2004 | B1 |
6895418 | Crow et al. | May 2005 | B1 |
6912548 | Black | Jun 2005 | B1 |
6931450 | Howard et al. | Aug 2005 | B2 |
6948062 | Clapper | Sep 2005 | B1 |
7007047 | Zelenka et al. | Feb 2006 | B2 |
7010528 | Curran et al. | Mar 2006 | B2 |
7089293 | Grosner et al. | Aug 2006 | B2 |
7099900 | Bromley et al. | Aug 2006 | B1 |
7173929 | Testardi | Feb 2007 | B1 |
7194538 | Rabe et al. | Mar 2007 | B1 |
7266556 | Coates | Sep 2007 | B1 |
7272661 | Sato | Sep 2007 | B2 |
7275103 | Thrasher et al. | Sep 2007 | B1 |
7281044 | Kagami et al. | Oct 2007 | B2 |
7313614 | Considine et al. | Dec 2007 | B2 |
7487152 | Uceda-Sosa et al. | Feb 2009 | B1 |
7496646 | Casper et al. | Feb 2009 | B2 |
7506040 | Rabe et al. | Mar 2009 | B1 |
20010047400 | Coates et al. | Nov 2001 | A1 |
20020010757 | Granik et al. | Jan 2002 | A1 |
20020054114 | Shuping et al. | May 2002 | A1 |
20020078244 | Howard | Jun 2002 | A1 |
20020083120 | Soltis | Jun 2002 | A1 |
20020133491 | Sim et al. | Sep 2002 | A1 |
20020133539 | Monday | Sep 2002 | A1 |
20030065896 | Krueger | Apr 2003 | A1 |
20030105865 | McCanne et al. | Jun 2003 | A1 |
20030149770 | Delaire et al. | Aug 2003 | A1 |
20030182285 | Kuwata et al. | Sep 2003 | A1 |
20030229645 | Mogi et al. | Dec 2003 | A1 |
20040019781 | Chari et al. | Jan 2004 | A1 |
20040039756 | Bromley | Feb 2004 | A1 |
20040078465 | Coates et al. | Apr 2004 | A1 |
20040078466 | Coates et al. | Apr 2004 | A1 |
20040088297 | Coates | May 2004 | A1 |
20070094378 | Baldwin et al. | Apr 2007 | A1 |
20080320134 | Edsall et al. | Dec 2008 | A1 |
Number | Date | Country |
---|---|---|
1726454 | Jan 2006 | CN |
0646858 | Aug 1994 | EP |
WO 9945491 | Sep 1999 | WO |
WO 0167707 | Sep 2001 | WO |
2004036408 | Apr 2004 | WO |
2004036408 | Apr 2004 | WO |
Number | Date | Country | |
---|---|---|---|
20040088297 A1 | May 2004 | US |
Number | Date | Country | |
---|---|---|---|
60419778 | Oct 2002 | US |