With the evolution and proliferation of computer systems and computer networks, modern users have come to rely on technical systems that were once thought of as luxuries. Email, chat, online sales, data access, and other related data services have become part of the daily routine of millions of users. As such, reliable data service with 24-hour access has become expected and relied upon by Internet users across the globe.
As a result of the tremendous pressure placed on companies to deliver reliable data services, many strategies have been implemented to assure continuous access such as data mirror sites, multiple redundant systems, clustered computing systems, and the like. In particular, clustered computing systems are being utilized by many data service providers for critical services. Clustered computing systems may be created by connecting two or more computers together in such a way that they behave like a single computer. Clustering may be used for parallel processing, load balancing, and fault tolerance. Clustering is a popular strategy for implementing parallel processing applications because it enables companies to leverage an investment already made in PCs and workstations. In addition, it's relatively easy to add new CPUs simply by adding a new PC to the network.
In the past, some companies utilized only a handful of computers executing relatively simple software. These early systems were relatively simple to manage especially when confronting and isolating problems. In the present networked computing environments and particularly in clustered systems, however, information systems can contain hundreds of interdependent servers and applications. Failure in one of these components can potentially cause a cascade of failures that could bring down one or more servers leaving providers susceptible to catastrophic data losses. One category of problem that is particularly troublesome for computing system administrators is a single point failure. A single point failure is a failure occurring at one point in a system that results in catastrophic failure of the entire system. Avoiding single point failures (along with other types of failures) by testing various configurations of clustered computing systems may, therefore, be desirable.
One problem encountered in maintaining clustered computing systems to avoid failures, is the dizzying array of interactions presented by modern clustered computing systems. For example, a two node cluster having at least four operational conditions (i.e. hardware/software constraints and requirements) may present as many as 8000 different possible configurations to a user. Testing and qualifying each of the eight thousand plus configurations may quickly become unfeasible due to time and resource constraints. The problem is exacerbated when those configurations are tested against an array of failure events.
In light of the foregoing, methods and systems for forecasting status of clustered computing systems are presented herein.
The invention provides methods of forecasting functionality for clustered computing configurations that may be deployed across computer network systems and environments that may function in conjunction with a wide range of hardware and software configurations.
An exemplary method of forecasting a forecast status of a clustered computing system is presented including: creating a current status model of the clustered computing system based on a start data set; applying an event input set to the current status model; and creating a forecast status based on the applying the event input set to the current status model. In some embodiments, the current status model may be represented by: a configured operational status, a current operational status, and a projected operational status of the clustered computing system. In some embodiments, the above applying an event input set and creating a forecast status may be repeated such that a plurality of event input sets may be tested. In some embodiments, the start data set includes: an application package information data set; a node information data set; a dependency information data set; and a priority information data set. In some embodiments, the dependency information data set includes: a same node exclusion dependency, an all node exclusion dependency, a same node up dependency, an any node up dependency, and a different node up dependency. In some embodiments, the event input set includes: a hardware failure, a hardware addition, a node failure, a node addition, an application package failure, a application package addition, a network failure, a package services failure, a shutdown, and a reboot.
Embodiments of the invention may best be understood by reference to the following description taken in conjunction with the accompanying drawings in which:
The present invention will now be described in detail with reference to a few embodiments herein as illustrated in the accompanying drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced without some or all of these specific details. In other instances, well known process steps and/or structures have not been described in detail in order to not unnecessarily obscure the present invention.
In accordance with embodiments of the present invention, there are provided methods and systems for forecasting operational status of clustered computing systems. Embodiments of the present invention allow a user to test configurations and event scenarios in clustered computing systems.
Referring to
Further, internet cloud 104 is merely a simplified illustration representing any number of network resources configured to maintain a linkage between users and clustered computing systems that provide services for users. Internet cloud 104 may represent, for example, a LAN, a WAN, or the Internet without limitation. As noted above, data communication links 120 may provide interconnection between clusters, between clusters and internets, and between internets and clients. That is, data communication links 120 may connect internet cloud 104 with a single user 124 or network of users 128 without limitation. One skilled in the art can appreciate that data communication links 120 may be implemented over any suitable protocol.
In an initial operating state, node 208 may be running application packages (hereinafter “package”) 240-244. A package may be a service such as email for example. Packages may also represent one or more applications being run in conjunction with a provided service. If, in one example, node 208 should fail as indicated by the dotted “X,” package 240 may be configured to migrate to node 204 while package 244 may be configured to migrate to node 212. Migration of packages 240 and 244 to nodes 204 and 212 respectively demonstrates a method by which clusters operate to provide highly available services. And while the illustrated cluster has only three nodes, more nodes may be configured in a cluster. Further, while only two packages are illustrated, many more packages may be configured and used in a cluster. In the illustrated example, a simple failover algorithm may be employed to accomplish migration. For example, a simple algorithm may take the form:
If node 2 fails, then package 1 migrates to node 1 and package 2 migrates to node 3 (1)
The above illustrative algorithm demonstrates an example relationship between clusters, nodes, and packages. Relationships may be much more complex and may include package dependency. Briefly, package dependency describes a set of conditions which must be fulfilled in order for a given package to operate properly. For example, a package dependency for a given package A might describe a configuration requiring that when another package (package B) is running, package A must wait until package B has ended. Package dependencies may be hardware, software, or environmentally dependent without limitation.
Nodes 312 and 314, as illustrated, include packages 320-330, and 332-344 respectively. Further, package 338, as illustrated, is disabled in an auto-run mode. Thus, a graphical icon may (e.g. “x”) be used to illustrate a particular conditions of a package. Packages may be generally described as an application or service. Packages may further be independent or dependent. Independent packages may run on a node and require no other packages or conflict with no other packages. Dependent packages have some configured package dependency which may relate to other packages, nodes, cluster resources, or clusters. The order in which packages are illustrated herein is not inherently limiting. Any desired order may be illustrated without departing from the present invention.
Also illustrated are halted packages 350-356. Halted packages are packages which, for whatever reason, are no longer running in the cluster. Halted packages may result, for example, from a software failure, a hardware failure, a combination of hardware or software failures, a time-out, a user selection, and others without limitation. Thus, the GUI as illustrated in
Command line text may also return a status of a clustered computer system. It may be appreciated that command line text may be implemented in any suitable convention that is well known in the art. The command line text illustrated below is for illustrative purposes only and should not be construed as limiting in any way. Thus, in one example, a command call of the type:
bmw:/>cmviewcl (2)
may return a table of information as shown below:
The above Table 1 corresponds to
Referring to
Package E 404 may also include a dependency component. One dependency component is illustrated by connection 432. Connection 432 is an example of a mutual exclusion dependency with respect to package B 416 to indicate that package E 404 cannot run concurrently with package B 416. Mutual exclusion dependency may be configured in any number of different manners. In one embodiment, package E 404 may be configured not to run simultaneously on the same node as package B 416. In other embodiments, package E 404 may be configured to not run simultaneously in the same cluster as package B 416.
Other dependency components may be configured as well. Connections 424 and 428 illustrate example same node dependencies. A same node dependency relationship describes a configuration where a given package requires another package to be running on a same node in order for the given package to run. As can be appreciated, dependencies may be temporally restricted. For example, as shown, package A 412 depends on package B 416 which in turn depends on package C 420. That is, package C 420 must be up and running before package B 416 may be run. In turn, package B 416 must be up and running before package A 412 may be run. Package dependencies may be necessary where a single package is insufficient to provide a desired service. For example, a finance program may require several database programs in order to provide a full suite of functionality. Thus, the finance program may be configured to depend on those database programs such that the database programs must be up and running before the finance program is started. Other example dependencies include, but are not limited to: an all node exclusion dependency, a same node up dependency, an any node up dependency, and a different node up dependency. These and other embodiments are contemplated in the present invention.
Still another condition component is a priority. In general, package priority corresponds to a user designated assignment of programmatic importance. Priority describes ascendancy with respect to packages. For example, a user may configure a set of packages on a cluster to provide desired services that might include: a database package, a mail server package, and a query package. In an ideal setting, all packages would be up and running thus providing all desired services. However, when a node failure occurs, for example, then some or all of the service providing packages may not be able to run on remaining nodes. In those instances, it may be useful to assign a priority to each package so that a system may preserve the most critical services. In this example, a high priority may be assigned to the database package while a low priority may be assigned to the query package. Thus, in the event of a node failure, the system will attempt to keep the database package running over the query package. Package priority is discussed in further detail in related application entitled, “SYSTEMS AND METHODS FOR PLACING AND DRAGGING PROGRAMMATIC PACKAGES IN CLUSTERED COMPUTING SYSTEMS,” which is incorporated herein by reference.
As can be appreciated, a dependency graph as illustrated in
Input component 504 also includes an event input set. An event input set includes, for example, any number of actual, expected, or hypothetical events which will be applied to a configuration defined by a start data set. In one example, a node failure may define an event input set. In another example, a package failure may define an event input set. In still other examples, a test configuration may define an event input set. As can be appreciated, any number of examples may be utilized to define an event input set.
Process component 508 includes a placement engine, and a forecast algorithm. Generally, placement is a process by which a package is assigned to a node. Placement on an assigned node takes into account location (i.e. node) and conditions (i.e. dependency and priority) for a given programmatic package so that user preferences may be preserved. Placement is discussed in further detail in related application entitled, “SYSTEMS AND METHODS FOR PLACING AND DRAGGING PROGRAMMATIC PACKAGES IN CLUSTERED COMPUTING SYSTEMS,” which is incorporated herein by reference.
A forecast algorithm may be used to generate an operational status based on a start data set and an event input set. Forecast algorithms will be discussed in further detail below for
Referring to
At a step 608, a current status model is created using a placement engine. As noted above, placement is a process by which a package is assigned to a node or in this case, modeled to a node. A current status model is a representation of the start data set received in step 604. As noted above, a current status model may be either represented textually as in Table 1 above or represented graphically as shown in
If the method determines more events are pending, the method returns to a step 612 and continues until no more events are pending. In this manner, a number of events may be applied to a current status model. As can be appreciated, event order is related to temporality since each event is taken in turn. Further iterative steps 612-616, may be conceptually represented by the following equations:
Result (1)=∫(a)
Result (2)=∫(∫(a))
Result (3)=∫(=(∫(a)))
Where (a) is start data and ∫ ( ) is the function that represents a step 616. (3)
In this embodiment, results from the application of an event become start data for a subsequent event until all events have been applied to a given model. An iterative model, as described above, may allow a user to account for temporally sensitive issues. For example, a package having failover properties that may optionally direct the package to more than one node may respond differently depending on which of the nodes fails first. Because relationships and rules may be highly interactive and interdependent, accounting for temporal issues may be difficult or impossible for a user to accomplish manually. Once all events have been processed, a forecast status model data may be output at a step 624. The method then ends.
While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, modifications, and various substitute equivalents as fall within the true spirit and scope of the present invention.
The present invention is related to the following application, all of which is incorporated herein by reference: Commonly assigned application entitled “SYSTEMS AND METHODS FOR PLACING AND DRAGGING PROGRAMMATIC PACKAGES IN CLUSTERED COMPUTING SYSTEMS,” filed on even date herewith by the same inventors herein (Attorney Docket Number: 200407298-1).