Distributed computing architectures enable large computational and data storage and retrieval operations to be performed by a number of different computers, thus reducing the time required to perform these operations. Distributed computing architectures are used for applications where the operations to be performed are complex, or where a large number of users are performing a large number of transactions using shared resources.
To reduce the costs of implementation and maintenance of distributed systems, low cost server devices commonly called blades are packaged together in a chassis to provide what is commonly called a blade server. Costs are reduced by minimizing the space occupied by the devices and by having the devices share power and other devices. Each blade is designed to be a low-cost, field replaceable component.
It would be desirable to implement a distributed computing architecture using blade servers that are highly available and scalable, particular for shared storage of high bandwidth real-time media data that is shared by a large number of users. However, providing high availability in a system with low-cost field replaceable components presents challenges.
A blade-based distributed computing system, for applications such as a storage network system, is made highly-available. The blade server integrates several computing blades and a blade for a switch that connects to the computing blades. Redundant components permit failover of operations from one component to its redundant component.
Configuration of one or more blade servers, such as assignment of high level network addresses to each blade, can be performed by a centralized process, called a configuration manager, on one blade in the system. High level network addresses can be assigned using a set of sequential network addresses for each blade server. A range of high level network addresses is assigned to each blade server. Each blade server in turn assigns high level network addresses to its blades. The high level network address for each blade can be mapped to its chassis identifier and slot identifier. Configuration information also may include software version information and software upgrades. By distributing configuration information among the various components of one or more blade servers, configuration information can be accessed by any component that acts as the configuration manager.
Each blade server also may monitor its own blades to determine whether they are operational, to communicate status information and/or initiate recovery operations. With status and configuration information available for each blade, and a mapping of network addresses for each blade to its physical position (chassis identifier and slot identifier), this information may be presented in a graphical user interface. Such an interface may include a graphical representation of the blade servers which a user manipulates to view various information about each blade server and about each blade.
An application of such a blade-based system is for shared storage for high bandwidth real-time media data accessed by various client applications. In such an application, data may be divided into segments and distributed among storage blades according to a non-uniform pattern.
In such a system, it may be desirable to manage the quality of service between client applications and the blade servers. The switch in each blade server allocates sufficient bandwidth for a port for a client according to the bandwidth required by the client. The client may indicate its bandwidth requirements to the storage system by informing the catalog manager. The catalog manager can inform the switches of the bandwidth requirements of the different clients. A client may periodically update its bandwidth requirements.
Each computing unit 102 is a device with a nonvolatile computer-readable medium, such as a disk, on which data may be stored. The computing unit also has faster, typically volatile, memory into which data is read from the nonvolative computer-readable medium. Each computing unit also has its own processing unit that is independent of the processing units of the other computing units, which may execute its own operating system, such as an embedded operating system, e.g., Windows XP Embedded, Linux and VxWorks operating systems, and application programs. For example, the computing unit may be implemented as a server computer that responds to requests for access, including but not limited to read and write access, to data stored on its nonvolatile computer-readable medium in one or more data files in the file system of its operating system. A computing unit may perform other operations in addition to data storage and retrieval, such as a variety of data processing operations.
Client computers 104 also are computer systems that communicate with the computing units 102 over the computer network 106. Each client computer may be implemented using a general purpose computer that has its own nonvolatile storage and temporary storage, and its own processor for executing an operating system and application programs. Each client computer 104 may be executing a different set of application programs and/or operating systems.
An example application of the system shown in
The latency between a request to transfer data, and the actual transmission of that request by the network interface of one of the units in such a system can be reduced using techniques described in U.S. patent application Ser. No. ______ entitled “Transmit Request Management in a Distributed Shared Storage System”, by Mitch Kuninsky, filed on 21 Sep. 2006, based upon U.S. Provisional Patent Application Ser. No. 60/748,838, incorporated herein by reference.
In one embodiment of such a distributed, shared file system the data of each file is divided into segments. Redundancy information for each segment is determined, such as a copy of the segment. Each segment and its redundancy information are stored on the storage of different computing units. The selection of a computing unit on which a segment, and its redundancy information, is stored according to any sequence of the computing units that provides a non-sequential distribution if the pattern of distribution is different from one file to the next and from the file to its redundancy information. For example, this sequence may be random, pseudorandom, quasi-random or a form of deterministic sequence, such as a permutation. An example distribution of copies of segments of data is shown in
The computing units 102 and computer network 106 shown in
Referring now to
Each chassis has a unique identifier among the chassis in the server system. This chassis identifier can be a permanent identifier that is assigned when the chassis is manufactured. Within the chassis, each physical position within the chassis is associated with a chassis position, called a slot identifier. This chassis position may be defined, for example, by hardwiring signals for each slot in the chassis which are received by the blade which it is installed in the chassis. Thus, each blade can be uniquely identified by its slot identifier and the chassis identifier.
Because a blade typically does not have a display or keyboard, communication of information about the status of the blade is typically is done through the network. However, if a blade is not functioning properly, communication from the blade may not occur. Even if communication did occur, it is difficult to determine, using conventional network address assignment protocols, such as Dynamic Host Configuration Protocol (DHCP), to determine the physical location of a blade given only its network address. In that case, the only way to find a blade is through its physical coordinates, which is a combination of the location of the chassis housing the blade (relative to other chassis in the same system) and the slot identifier for the blade in that chassis. Finding the location of a blade also is important during system development, system installation, service integration and other activities. Both switch blades and compute blades have unique slot identifiers within the chassis.
Accordingly, the network is preferably configured in a manner such that the slot identifier and chassis identifier for a blade (whether for a computing unit or a switch) can be determined from its network address. Such a configuration can be implemented such that all blades within a chassis are assigned addresses within a range of addresses that does not overlap with the range of addresses assigned to blades in other chassis. These network addresses may be sequential and assigned sequentially according to slot identifier. To provide high availability and automatic configurability, this configuration preferably is implemented automatically upon startup, reboot, replacement, addition or upgrade of a chassis or blade within a chassis. A table is maintained that tracks, for each pair of slot identifier and chassis identifier, the corresponding configuration information including the network address (typically an IP address) of the device, and optionally other information such as the time the device was configured, services available on the device, etc. A separate table associates the chassis position (relative to other chassis) and the chassis identifier. It is possible to create this association either manually or automatically, for example by integrating location tracking mechanisms such as a global positioning system (GPS) into the chassis. This configuration information may be stored in a blade in nonvolatile memory so as to survive a loss of power to the blade. The configuration information may be stored in each blade to permit any blade to act as a configuration manager, or to permit any configuration manager to access configuration information.
Referring now to
To initiate configuration of a multi-chassis installation, a user picks any one of the chassis and provides configuration information for the entire installation, including network address blocks, time, etc., to one of the switch blades. This selected switch blade then passes the configuration information to the configuration manager, a process executed on one of the switch blades. One of the switch blades is selected as a configuration manager. Any reasonable technique can be used to select a device as a configuration manager. For example, upon startup each switch blades may transmit low level network messages, including its chassis identifier, to other switch blades in the system. A switch with the lowest chassis identifier could be selected as the configuration manager. If the blade that is running the configuration manager is removed (which is possible because it is a field replaceable unit), another switch blade takes over the responsibility of the configuration manager. This is accomplished by having the configuration manager periodically send a message to the switch blades of other chassis indicating that it is operational. In one embodiment, the configuration manager may be defined manually through external user input. When the other switch blades determine that the configuration manager is not operational, another switch blade takes over the operation of the configuration manager.
The configuration manager may receive the chassis identifier of every chassis in the system from the switch blades in that chassis. Every switch blade may communicate to each other via a form of unicast or multicast protocol. The configuration manager may then order the chassis identifiers into a table, and assign each chassis a range of network addresses from the larger address block. This information may then be sent back to every switch blade in each chassis. The switch blade of a chassis receives the range of network addresses assigned to the chassis and assigns a network address to each of the blades in the chassis. The configuration manager ensures that each switch blade, and optionally each blade in each chassis, maintains a copy of the configuration information for the system.
Each chassis also may have a chassis manager that is an application that monitors the status of the blades and the applications running on the blades. There is a chassis manager in every chassis, but only one configuration manager in the entire installation. Both of these functions reside on the CPU within a switch blade. A process executed by the chassis manager will now be described in connection with
The type and complexity of the recovery procedure depends on the device or application being monitored. For example, if an application is not responding, the chassis manager may instruct the operating system for the blade that is executing that application to terminate that application's process and restart it. An operating system that has failed may cause the blade to be restarted. If a device with a corresponding redundant device has failed, the redundant device could be started. If failure of a hardware device is detected, a system administrator application could be notified of the failure.
As a particular example of the operation of the chassis manager,
If a computing unit blade fails and needs to be replaced, when a new computing element is added it is configured within the chassis. When a computing blade unit is added, it is configured so that its network address is the same as the unit it replaced. The process for it receiving the network address is described above. With the computing blade restarted, its relevant applications and device can initiate sending status messages to the chassis manager on the switch blade.
Operations for managing failure and replacement of switch blades will now be described. The potential risk of a catastrophic failure of the server operation due to failure of a switch blade in a blade server is reduced by providing redundant switch blades. Using redundant switch blades ensures network connectivity to each computing blade server and service continuity in spite of a switch blade failure. During normal operation, one of the switch blades is designated as the active chassis manager, whereas the other is designated as a passive chassis manager. Both switch blades still perform as switches, but only one of them is the active chassis manager. The switches in a chassis are connected via redundant, serial or Ethernet control paths, to monitor activity of each other, as well as exchange installation configuration information with each other. One of the switches in the blade server assumes the role of the active switch, for example, if it has the most current configuration data, or if it has a lower slot identifier. When a switch blade is replaced, the new switch typically does not have the most current configuration data. In that case, it receives the configuration data from the chassis manager, as well as other switch blades that comprise the redundant switch network.
During normal operation, the chassis manager executes on one switch blade CPU and monitors status messages from the passive chassis manager on the other switch blade. If failure of a passive chassis manager is detected, the active chassis manager attempts to restart the switch blade or can communicate its failure condition.
Also during normal operation, the passive chassis manager monitors status messages from the switch blade with the active chassis manager.
Another area in which high availability can be provided is in the upgrading of software of a blade. Each blade (whether a computing unit blade or a switch blade) maintains in nonvolatile memory a current, valid configuration table identifying the firmware, including a boot loader, an operating system, and applications to be loaded. A shadow copy of this table is maintained. Additionally, shadow copies of the firmware, operating system and applications are maintained.
As these operations demonstrate, each blade server monitors its own blades to determine whether they are operational, to communicate status information and/or to initiate recovery operations. With status and configuration information available for each blade, and with the mapping of network addresses for each blade to its physical position (chassis identifier and slot identifier), this information may be presented in a graphical user interface. Such an interface may include a graphical representation of the blade servers which a user manipulates to view various information about each blade server and about each blade.
The foregoing system is particularly useful in implementing a highly available, blade based distributed, shared file system for supporting high bandwidth temporal media data, such as video and audio data, that is captured, edited and played back in an environment with a large number of users. Because the topology of the network can be derived from the network addresses, this information can be used to partition use of the blade servers to provide various performance enhancements. For example, high resolution material can be segregated from low resolution material based upon networking topology and networking bottlenecks, which in turn will segregate network traffic from different clients into different parts of the network. In such an application, data may be divided into segments and distributed among storage blades according to a non-uniform pattern within the set of storage blades designated for each type of content.
In such a system, it may be desirable to manage the quality of service between client applications and the blade servers. The switch in each blade server allocates sufficient bandwidth or buffering for a port for a client according to the bandwidth required by the client. The client may indicate its bandwidth or burstiness requirements to the storage system by informing the catalog manager. The catalog manager can inform the switches of the bandwidth or burstiness requirements of the different clients. A client may periodically update its bandwidth or burstiness requirements.
Having now described an example embodiment, it should be apparent to those skilled in the art that the foregoing is merely illustrative and not limiting, having been presented by way of example only. Numerous modifications and other embodiments are within the scope of one of ordinary skill in the art and are contemplated as falling within the scope of the invention.
This application claims the benefit of priority to U.S. provisional patent application Ser. No. 60/720,152 entitled “Highly-Available Blade-Based Distributed Computing System” filed 23 Sep. 2005, 60/748,839 having the same title filed 9 Dec. 2005, and 60/748,840 entitled “Distribution of Data in a Distributed Shared Storage System” filed 9 Dec. 2005. This application is related to non-provisional patent application Ser. No. ______ entitled “Distribution of Data in a Distributed Shared Storage System” and Ser. No. ______ entitled “Transmit Request Management in a Distributed Shared Storage System”, both filed 21 Sep. 2006. The contents of all of the aforementioned applications are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
60720152 | Sep 2005 | US | |
60748840 | Dec 2005 | US | |
60748839 | Dec 2005 | US |