This invention relates, in general, to data sharing in a communications environment, and in particular, to dynamically managing one or more clusters of nodes to enable the sharing of data.
Clustering is used for various purposes, including parallel processing, load balancing and fault tolerance. Clustering includes the grouping of a plurality of nodes, which share resources and collaborate with each other to perform various tasks, into one or more clusters. A cluster may include any number of nodes.
Advances in technology have affected the size of clusters. For example, the evolution of storage area networks (SANs) has produced clusters with large numbers of nodes. Each of these clusters has a fixed known set of nodes with known network addressability. Each of these clusters has a common system management, common user domains and other characteristics resulting from the static environment.
The larger the cluster, typically, the more difficult it is to manage. This is particularly true when a cluster is created as a super-cluster that includes multiple sets of resources. This super-cluster is managed as a single large cluster of thousands of nodes. Not only is management of such a cluster difficult, such centralized management may not meet the needs of one or more sets of nodes within the super-cluster.
Thus, a need exists for a capability that facilitates management of clusters. As one example, a need exists for a capability that enables creation of a cluster and the dynamic joining of nodes to that cluster to perform a specific task.
The shortcomings of the prior art are overcome and additional advantages are provided through the provision of a method of managing clusters of a communications environment. The method includes, for instance, obtaining a cluster of nodes, the cluster of nodes comprising one or more nodes of a data owning cluster; and dynamically joining the cluster of nodes by one or more other nodes to access data owned by the data owning cluster.
System and computer program products corresponding to the above-summarized method are also described and claimed herein.
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention.
The subject matter, which is regarded as the invention, is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
In accordance with an aspect of the present invention, clusters are dynamically provided to enable data access. As one example, an active cluster is formed, which includes one or more nodes from at least one data owning cluster and one or more nodes from at least one data using cluster. A node of a data using cluster dynamically joins the active cluster, in response to, for instance, a request by the node for data owned by a data owning cluster. A successful join enables the data using node to access data of the data owning cluster, assuming proper authorization.
One example of a cluster configuration is depicted in
Nodes 102 are also coupled to a storage area network (SAN) 106, which further couples the nodes to one or more storage media 108. The storage media includes, for instance, disks or other types of storage media. The storage media include files having data to be accessed. A collection of files is referred to herein as a file system, and there may be one or more file systems in a given cluster.
A file system is managed by a file system manager node 110, which is one of the nodes of the cluster. The same file system manager can manage one or more of the file systems of the cluster or each file system may have its own file system manager or any combination thereof. Also, in a further embodiment more than one file system manager may be selected to manage a particular file system.
An alternate cluster configuration is depicted in
The data flow between the server nodes and the communications nodes is the same as addressing the storage media directly, although the performance and/or syntax may be different. As examples, the data flow of
In accordance with an aspect of the present invention, one cluster may be coupled to one or more other clusters, while still maintaining separate administrative and operational domains for each cluster. For instance, as depicted in
Each of the clusters is maintained separately allowing individual administrative policies to prevail within a particular cluster. This is in contrast to merging the clusters, and thus, the resources of the clusters, creating a single administrative and operational domain. The separate clusters facilitate management and provide greater flexibility.
Additional clusters may also be coupled to one another, as depicted in
Although in each of the clusters described above five nodes are depicted, this is only one example. Each cluster may include one or more nodes and each cluster may have a different number or the same number of nodes as another cluster.
In accordance with an aspect of the present invention, a cluster may be at least one of a data owning cluster, a data using cluster and an active cluster. A data owning cluster is a collection of nodes, which are typically, but not necessarily, co-located with the storage used for at least one file system owned by the cluster. The data owning cluster controls access to the one or more file systems, performs management functions on the file system(s), controls the locking of the objects which comprise the file system(s) and/or is responsible for a number of other central functions.
The data owning cluster is a collection of nodes that share data and have a common management scheme. As one example, the data owning cluster is built out of the nodes of a storage area network, which provides a mechanism for connecting multiple nodes to the same storage media and providing management software therefor.
As one example, a file system owned by the data owning cluster is implemented as a SAN file system, such as a General Parallel File System (GPFS), offered by International Business Machines Corporation, Armonk, N.Y. GPFS is described in, for instance, “GPFS: A Parallel File System,” IBM Publication No. SG24-5165-00 (May 7, 1998), which is hereby incorporated herein by reference in its entirety.
Applications can run on the data owning clusters. Further, the user id space of the owning cluster is the user id space that is native to the file system and stored within the file system.
A data using cluster is a set of one or more nodes which desires access to data owned by one or more data owning clusters. The data using cluster runs applications that use data available from one or more owning clusters. The data using cluster has configuration data available to it directly or through external directory services. This data includes, for instance, a list of file systems which might be available to the nodes of the cluster, a list of contact points within the owning cluster to contact for access to the file systems, and a set of credentials which allow access to the data. In particular, the data using cluster is configured with sufficient information to start the file system code and a way of determining the contact point for each file system that might be desired. The contact points may be defined using an external directory service or be included in a list within a local file system of each node. The data using cluster is also configured with security credentials which allow each node to identify itself to the data owning clusters.
An active cluster includes one or more nodes from at least one data owning cluster, in addition to one or more nodes from at least one data using cluster that have registered with the data owning cluster. For example, the active cluster includes nodes (and related resources) that have data to be shared and those nodes registered to share data of the cluster.
A node of a data using cluster can be part of multiple active clusters and a cluster can concurrently be a data owning cluster for a file system and a data using cluster for other file systems. Just as a data using cluster may access data from multiple data owning clusters, a data owning cluster may serve multiple data using clusters. This allows dynamic creation of active clusters to perform a job using the compute resources of multiple data using clusters. The job scheduling facility selects nodes, from a larger pool, which will cooperate in running the job. The capability of the assigned jobs to force the node to join the active cluster for the required data using the best available path to the data provides a highly flexible tool in running large data centers.
Examples of active clusters are depicted in
Similarly, an Active Cluster 2 (506) includes a plurality of nodes from West cluster 508 that control one or more file systems and a plurality of data using nodes from North cluster 504. Node C of North Cluster 504 is part of Active Cluster 1, as well as Active Cluster 2. Although in these examples, all of the nodes of West Cluster and East Cluster are included in their respective active clusters, in other examples, less than all of the nodes are included.
The nodes which are part of a non-data owning cluster are in an active cluster for the purpose of doing specific work at this point in time. North nodes A and B could be in Active Cluster 2 at a different point in time doing different work. Note that West nodes could join Active Cluster 1 also if the compute requirements include access to data on the East cluster. Many other variations are possible.
In yet another configuration, a compute pool 600 (
In order to form active clusters, the data owning and data using clusters are to be configured. Details associated with configuring such clusters are described with reference to
Referring to
Further, in this example, one or more file systems to be owned by the cluster are also installed. These file systems include the data to be shared by the nodes of the various clusters. In one example, the file systems are the General Parallel File Systems (GPFS), offered by International Business Machines Corporation. One or more aspects of GPFS are described in “GPFS: A Parallel File System,” IBM Publication No. SG24-5165-00 (May 7, 1998), which is hereby incorporated herein by reference in its entirety, and in various patents/publications, including, but not limited to, U.S. Pat. No. 6,708,175 entitled “Program Support For Disk Fencing In A Shared Disk Parallel File System Across Storage Area Network,” Curran et al., issued Mar. 16, 2004; U.S. Pat. No. 6,032,216 entitled “Parallel File System With Method Using Tokens For Locking Modes,” Schmuck et al., issued Feb. 29, 2000; U.S. Pat. No. 6,023,706 entitled “Parallel File System And Method For Multiple Node File Access,” Schmuck et al, issued Feb. 8, 2000; U.S. Pat. No. 6,021,508 entitled “Parallel File System And Method For Independent Metadata Loggin,” Schmuck et al., issued Feb. 1, 2000; U.S. Pat. No. 5,999,976 entitled “Parallel File System And Method With Byte Range API Locking,” Schmuck et al., issued Dec. 7, 1999; U.S. Pat. No. 5,987,477 entitled “Parallel File System And Method For Parallel Write Sharing,” Schmuck et al., issued Nov. 16, 1999; U.S. Pat. No. 5,974,424 entitled “Parallel File System And Method With A Metadata Node,” Schmuck et al., issued Oct. 26, 1999; U.S. Pat. No. 5,963,963 entitled “Parallel File System And Buffer Management Arbitration,” Schmuck et al., issued Oct. 5, 1999; U.S. Pat. No. 5,960,446 entitled “Parallel File System And Method With Allocation Map,” Schmuck et al., issued Sep. 28, 1999; U.S. Pat. No. 5,950,199 entitled “Parallel File System And Method For Granting Byte Range Tokens,” Schmuck et al., issued Sep. 7, 1999; U.S. Pat. No. 5,946,686 entitled “Parallel File System And Method With Quota Allocation,” Schmuck et al., issued Aug. 31, 1999; U.S. Pat. No. 5,940,838 entitled “Parallel File System And Method Anticipating Cache Usage Patterns,” Schmuck et al., issued Aug. 17, 1999; U.S. Pat. No. 5,893,086 entitled “Parallel File System And Method With Extensible Hashing,” Schmuck et al., issued Apr. 6, 1999; U.S. Patent Application Publication No. 20030221124 entitled “File Level Security For A Metadata Controller In A Storage Area Network,” Curran et al., published Nov. 27, 2003; U.S. Patent Application Publication No. 20030220974 entitled “Parallel Metadata Service In Storage Area Network Environment,” Curran et al., published Nov. 27, 2003; U.S. Patent Application Publication No. 20030018785 entitled “Distributed Locking Protocol With Asynchronous Token Prefetch And Relinquish,” Eshel et al., published Jan. 23, 2003; U.S. Patent Application Publication No. 20030018782 entitled “Scalable Memory Management Of Token State For Distributed Lock Managers,” Dixon et al., published Jan. 23, 2003; and U.S. Patent Application Publication No. 20020188590 entitled “Program Support For Disk Fencing In A Shared Disk Parallel File System Across Storage Area Network,” Curran et al., published Dec. 12, 2002, each of which is hereby incorporated herein by reference in its entirety.
Although the use of file systems is described herein, in other embodiments, the data to be shared need not be maintained as file systems. Instead, the data may merely be stored on the storage media or stored as a structure other than a file system.
Subsequent to installing the data owning cluster and file systems, the data owning cluster, also referred to as the home cluster, is configured with authorization and access controls for nodes wishing to join an active cluster for which the data owning cluster is a part, STEP 802. For example, for each file system, a definition is provided specifying whether the file system may be accessed outside the owning cluster. If it may be accessed externally, then an access list of nodes or a set of required credentials is specified. As one example, a pluggable security infrastructure is implemented using a public key authentication. Other security mechanisms can also be plugged. This concludes installation of the data owning cluster.
One embodiment of the logic associated with installing a data using cluster is described with reference to
Initially, file system code is installed and local configuration selections are made, STEP 900. For instance, there are various parameters that pertain to network and memory configuration which are used to install the data using cluster before it accesses data. The file system code is installed by, for instance, an administrator using the native facilities of the operating system. For example, rpm on Linux is used. Certain parameters which apply to the local node are specified. These parameters include, for instance, which networks are available, what memory can be allocated and perhaps others.
Thereafter, a list of available file systems and contact nodes of the owning file systems is created or the name of a resource directory is configured, STEP 902. In particular, there are, for instance, two ways of finding the file system resources that are applicable to the data using cluster: either by, for instance, a system administrator explicitly configuring the list of available file systems and where to find them, or by creating a directory at a known place, which may be accessed by presenting the name of the file system that the application is requesting and receiving back a contact point for it. The list includes, for instance, a name of the file system, the cluster that contains that file system, and one or more contact points for the cluster.
In addition to the above, a user translation program is configured, STEP 904. For instance, the user translation program is identified by, for example, a system administrator (e.g., a pointer to the program is provided). The translation program translates a local user id to a user id of the data owning cluster. This is described in further detail below. In another embodiment, a translation is not performed, since a user's identity is consistent everywhere.
Additionally, security credentials are configured by, for instance, a system administrator, for each data owning (or home) cluster to which access is possible, STEP 906. Security credentials may include the providing of a key. Further, each network has its own set of rules as to whether security is permissible or not. However, ultimately the question resolves to: prove that I am who I say I am or trust that I am who I say I am.
Subsequent to installing the one or more data owning clusters and the one or more data using clusters, those clusters may be used to access data. One embodiment of the logic associated with accessing data is described with reference to
After mount processing or if the file system has previously been mounted, then a further determination is made as to whether the lease for the storage medium (e.g., disk) having the desired file is valid, INQUIRY 1006. That is, access to the data is controlled by establishing leases for the various storage media storing the data to be accessed. Each lease has an expiration parameter (e.g., date and/or time) associated therewith, which is stored in memory of the data using node. To determine whether the lease is valid, the data using node checks the expiration parameter. Should the lease be invalid, then a retry is performed, if allowed, or an error is presented, if not allowed, STEP 1008. On the other hand, if the lease is valid, then the data is served to the application, assuming the user of the application is authorized to receive the data, STEP 1010.
Authorization of the user includes translating the user identifier of the request from the data using node to a corresponding user identifier at the data owning cluster, and then checking authorization of that translated user identifier. One embodiment of the logic associated with performing the authorization is described with reference to
Initially, an application on the data using node opens a file and the operating system credentials present a local user identifier, STEP 1100. The local identifier on the using node is converted to the identifier at the data owning cluster, STEP 1102. As one example, a translation program executing on the data using node is used to make the conversion. The program includes logic that accesses a table to convert the local identifier to the user identifier at the owning cluster.
One example of a conversion table is depicted below:
The table is created by a system administrator, in one example, and includes various columns, including, for instance, a user identifier at the using cluster and a user identifier at the owning cluster, as well as a user name at the using cluster and a user name at the owning cluster. Typically, it is the user name that is provided, which is then associated with a user id. As one example, a program invoked by Sally on a node in the data using cluster creates a file. If the file is created in local storage, then it is assigned to be owned by user id 8765 representing Sally. However, if the file is created in shared storage, it is created using user id 5678 representing Sjones. If Sally tries to access an existing file, the file system is presented user id 8765. The file system invokes the conversion program and is provided with id 5678.
Subsequent to converting the local identifier to the identifier at the data owning cluster, a determination is made as to whether the converted identifier is authorized to access the data, STEP 1104. This determination may be made in many ways, including by checking an authorization table or other data structure. If the user is authorized, then the data is served to the requesting application.
Data access can be performed by direct paths to the data (e.g., via a storage area network (SAN), a SAN enhanced with a network connection, or a software simulation of a SAN using, for instance, Virtual Shared Disk, offered by International Business Machines Corporation); or by using a server node, if the node does not have an explicit path to the storage media, as examples. In the latter, the server node provides a path to the storage media.
During the data service, the file system code of the data using node reads from and/or writes to the storage media directly after obtaining appropriate locks. The file system code local to the application enforces authorization by translating the user id presented by the application to a user id in the user space of the owning cluster, as described herein. Further details regarding data flow and obtaining locks are described in the above-referenced patents/publications, each of which is hereby incorporated herein by reference in its entirety.
As described above, in order to access the data, the file system that includes the data is to be mounted. One embodiment of the logic associated with mounting the file system is described with reference to
Referring to
Subsequent to determining the contact nodes, a request is sent to a contact node requesting the address of the file system manager for the desired file system, STEP 1204. If the particular contact node for which the request is sent does not respond, an alternate contact node may be used. By definition, a contact node that responds knows how to access the file system manager.
In response to receiving a reply from the contact node with the identity of the file system manager, a request is sent to the file system manager requesting mount information, STEP 1206. The request includes any required security credentials, and the information sought includes the details the data using node needs to access the data. For instance, it includes a list of the storage media (e.g., disks) that make up the file system and the rules that are used in order to access the file system. As one example, a rule includes: for this kind of file system, permission to access the file system is to be sought every X amount of time. Many other rules may also be used.
Further details regarding the logic associated with the file system manager processing the mount request are described with reference to
In one embodiment, the file system manager accepts mount requests from a data using node, STEP 1300. In response to receiving the request, the file system manager takes the security credentials from the request and validates the security credentials of the data using node, STEP 1302. This validation may include public key authentication, checking a validation data structure (e.g., table), or other types of security validation. If the credentials are approved, the file system manager returns to the data using node a list of one or more servers for the needed or desired storage media, STEP 1304. It also returns, in this example, for each storage medium, a lease for standard lease time. Additionally, the file system manager places the new data using node on the active cluster list and notifies other members of the active cluster of the new node.
Returning to
The data using node mounts the file system using received information and disk paths, allowing access by the data using node to data owned by the data owning cluster, STEP 1210. As an example, a mount includes reading each disk in the file system to insure that the disk descriptions on the disks match those expected for this file system, in addition to setting up the local data structures to translate user file requests to disk blocks on the storage media. Further, the leases for the file system are renewed as indicated by the file system manager. Additionally, locks and disk paths are released, if no activity for a period of time specified by the file system manager is met.
Subsequent to successfully mounting the file system on the data using node, a heart beating protocol, referred to as a storage medium (e.g., disk) lease, is begun. The data using node requests permission to access the file system for a period of time and is to renew that lease prior to its expiration. If the lease expires, no further I/O is initiated. Additionally, if no activity occurs for a period of time, the using node puts the file system into a locally suspended state releasing the resources held for the mount both locally and on the data owning cluster. Another mount protocol is executed, if activity resumes.
One example of maintaining a lease is described with reference to
Further details regarding disk leasing are described in U.S. patent application Ser. No. 10/154,009 entitled “Parallel Metadata Service In Storage Area Network Environment,” Curran et al., filed May 23, 2002, and U.S. Pat. No. 6,708,175 entitled “Program Support For Disk Fencing In A Shared Disk Parallel File System Across Storage Area Network,” Curran et al., issued Mar. 16, 2004, each of which is hereby incorporated herein by reference in its entirety.
In accordance with an aspect of the present invention, if all of the file systems used by a data using node are unmounted, INQUIRY 1500 (
Described in detail above is a capability in which one or more nodes of a data using cluster may dynamically join one or more nodes of a data owning cluster for the purposes of accessing data. By registering the data using cluster (at least a portion thereof) with the data owning cluster (at least a portion thereof), an active cluster is formed. A node of a data using cluster may access data from multiple data owning clusters. Further, a data owning cluster may serve multiple data using clusters. This allows dynamic creation of active clusters to perform a job using the compute resources of multiple data using clusters.
In accordance with an aspect of the present invention, nodes of one cluster can directly access data (e.g., without copying the data) of another cluster, even if the clusters are geographically distant (e.g., even in other countries).
Advantageously, one or more capabilities of the present invention enable the separation of data using clusters and data owning clusters; allow administration and policies the ability to have the data using cluster be part of multiple clusters; provide the ability to dynamically join an active cluster and leave that cluster when active use of the data is no longer desired; and provide the ability of the node which has joined the active cluster to participate in the management of metadata.
A node of the data using cluster may access multiple file systems for multiple locations by simply contacting the data owning cluster for each file system desired. The data using cluster node provides appropriate credentials to the multiple file systems and maintains multiple storage media leases. In this way, it is possible for a job running at location A to use data, which resides at locations B and C, as examples.
As used herein, a node is a machine; device; computing unit; computing system; a plurality of machines, computing units, etc. coupled to one another; or anything else that can be a member of a cluster. A cluster of nodes includes one or more nodes. The obtaining of a cluster includes, but is not limited to, having a cluster, receiving a cluster, providing a cluster, forming a cluster, etc.
Further, the owning of data refers to owning the data, one or more paths to the data, or any combination thereof. The data can be stored locally or on any type of storage media. Disks are provided herein as only one example.
Although examples of clusters have been provided herein, many variations exist without departing from the spirit of the present invention. For example, different networks can be used, including less reliable networks, since faults are tolerated. Many other variations also exist.
The capabilities of one or more aspects of the present invention can be implemented in software, firmware, hardware or some combination thereof.
One or more aspects of the present invention can be included in an article of manufacture (e.g., one or more computer program products) having, for instance, computer usable media. The media has therein, for instance, computer readable program code means or logic (e.g., instructions, code, commands, etc.) to provide and facilitate the capabilities of the present invention. The article of manufacture can be included as a part of a computer system or sold separately.
Additionally, at least one program storage device readable by a machine embodying at least one program of instructions executable by the machine to perform the capabilities of the present invention can be provided.
The flow diagrams depicted herein are just examples. There may be many variations to these diagrams or the steps (or operations) described therein without departing from the spirit of the invention. For instance, the steps may be performed in a differing order, or steps may be added, deleted or modified. All of these variations are considered a part of the claimed invention.
Although preferred embodiments have been depicted and described in detail herein, it will be apparent to those skilled in the relevant art that various modifications, additions, substitutions and the like can be made without departing from the spirit of the invention and these are therefore considered to be within the scope of the invention as defined in the following claims.