Method for providing a fault tolerant network using distributed server processes to remap clustered network resources to other servers during server failure

Description

APPENDICES

Appendix A, which forms a part of this disclosure, is a list of commonly owned copending U.S. patent applications. Each one of the applications listed in Appendix A is hereby incorporated herein in its entirety by reference thereto.

Appendix B, which forms part of this disclosure, is a copy of the U. S. provisional patent application filed May 13, 1997, entitled “Clustering Of Computer Systems Using Uniform Object Naming And Distributed Software For Locating Objects,” and assigned application Ser. No. 60/046,327.

COPYRIGHT AUTHORIZATION

A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.

BACKGROUND OF THE INVENTION

1. Field of Invention

The present invention pertains to computer networks. More particularly, this invention relates to improving the ability of a network to route around faulty components.

2. Description of Related Art

As computer systems and networks become more complex, various systems for promoting fault tolerance have been devised. To prevent network down-time due to power failure, uninterrupted power supplies (UPS) have been developed. A UPS is basically a rechargeable battery to which a workstation or server is connected. In the event of a power failure the workstation or server is maintained in operation by the rechargeable battery until such time as the power resumes.

To prevent network down-time due to failure of a storage device, data mirroring was developed. Data mirroring provides for the storage of data on separate physical devices operating in parallel with respect to a file server. Duplicate data is stored on separate drives. Thus, when a single drive fails the data on the mirrored drive may still be accessed.

To prevent network down-time due to a print/file server, server mirroring has been developed. Server mirroring as it is currently implemented requires a primary server and storage device, a backup server and storage device, and a unified operating system linking the two. An example of a mirrored server product is the Software Fault Tolerance level 3 (SFT III) product by Novell Inc., 1555 North Technology Way, Orem, Utah, as an add-on to its NetWare® 4.x product. SFT III maintains servers in an identical state of data update. It separates hardware-related operating system (OS) functions on the mirrored servers so that a fault on one hardware platform does not affect the other. The server OS is designed to work in tandem with two servers. One server is designated as a primary server, and the other is a secondary server. The primary server is the main point of update; the secondary server is in a constant state of readiness to take over. Both servers receive all updates through a special link called a mirrored server link (MSL), which is dedicated to this purpose. The servers also communicate over the local area network (LAN) that they share in common, so that one knows if the other has failed even if the MSL has failed. When a failure occurs, the second server automatically takes over without interrupting communications in any user-detectable way. Each server monitors other server's NetWare Core Protocol (NCP) acknowledgments over the LAN to see that all the requests are serviced and that OSs are constantly maintained in a mirrored state.

When the primary server fails, the secondary server detects the failure and immediately takes over as the primary server. The failure is detected in one or both of two ways: the MSL link generates an error condition when no activity is noticed, or the servers communicate over the LAN, each one monitoring the other's NCP acknowledgment. The primary server is simply the first server of the pair that is brought up. It then becomes the server used at all times and it processes all requests. When the primary server fails, the secondary server is immediately substituted as the primary server with identical configurations. The switch-over is handled entirely at the server end, and work continues without any perceivable interruption.

Power supply backup, data mirroring, and server mirroring all increase security against down time caused by a failed hardware component, but they all do so at considerable cost. Each of these schemes requires the additional expense and complexity of standby hardware, that is not used unless there is a failure in the network. Mirroring, while providing redundancy to allow recovery from failure, does not allow the redundant to be used to improve cost/performance of the network.

What is needed is a fault tolerant system for computer networks that can provide all the functionality of UPS, disk mirroring, or server mirroring without the added cost and complexity of standby/additional hardware. What is needed is a fault tolerant system for computer networks which smoothly interfaces with existing network systems.

SUMMARY OF THE INVENTION

In an embodiment of the invention, the method comprises the acts of:

providing a network resource database, the database includes individual records corresponding to clustered network resources, and for the clustered resources, the database includes a first record corresponding to the network resource, and the first record identifies the primary server for the network resource as the first server;

selecting, on the basis of the first record, the first server to provide service to the client workstation with respect to that clustered network resource;

recognizing the backup server for that clustered network resource as the second server;

detecting a failure of the first server; and

routing communications between the client workstation and the network resource via the second server, responsive to the recognizing and detecting acts.

In another embodiment of the invention, the method comprises the additional acts of:

identifying in the first record, the primary server for the network resource as the first server;

discovering a recovery of the first server; and

re-routing communications between the client workstation and the network resource via the first server, responsive to the identifying and discovering acts.

DESCRIPTION OF FIGURES

FIG. 1

is a hardware block diagram of a prior art network with replicated database for tracking network resources.

FIG. 2

is a functional block diagram of the replicated database used in the prior art network shown in FIG.

1

.

FIG. 3

is a hardware block diagram showing a network with server resident processes for providing fault tolerant network resource recovery in accordance with the current invention.

FIG. 4

is a detailed block diagram of an enhanced network resource object definition which operates in conjunction with the server resident processes of FIG.

3

.

FIGS. 5A-E

are hardware block diagrams showing detection, fail-over and fail-back stages of the current invention for a storage device connected to a primary and backup server.

FIGS. 6A-E

show the object record for the storage device of

FIG. 5

on both the primary and secondary server during the stages of detection, fail-over and fail-back.

FIG. 7

is a functional block diagram showing the processing modules of the current invention on a server.

FIG. 8A-C

are process flow diagrams showing the authentication, detection, fail-over, recovery detection, and fail-back processes of the current invention.

DETAILED DESCRIPTION

The method of the current invention provides a fault tolerant network without hardware mirroring. The invention involves an enhanced replicated network directory database which operates in conjunction with server resident processes to remap network resources in the event of a server failure. In some embodiments, the enhanced network directory database is replicated throughout all servers in the cluster. The records/objects in the enhanced database contain for at least 1 clustered resource, a primary and a secondary server affiliation. Initially, all users access a clustered resource through the server identified in the enhanced database as being the primary server for that clustered resource. When server resident processes detect a failure of the primary server the enhanced database is updated to reflect the failure of the primary server, and to change the affiliation of the resource from its primary to its backup server. The updating and remapping is accomplished by server resident processes which detect failure of the primary server, and remap the network resource server affiliation. This remapping occurs transparently to whichever user/client is accessing the resource. Thus, network communications are not interrupted and all users access a resource through its backup server, while its primary server is out of operation. This process may be reversed when the primary server resumes operation, thereby regaining fault tolerant, i.e., backup capability.

No dedicated redundant resources are required to implement the current invention. Rather, the current invention allows server resident processes to intelligently reallocate servers to network resources in the event of server failure.

FIG. 1

is a hardware block diagram of a prior art enterprise network comprising LAN segments

50

and

52

. LAN segment

50

comprises workstations

70

-

72

, server

58

, storage device

82

, printer

86

and router

76

. Server

58

includes display

64

. LAN segment

52

includes workstations

66

and

68

, servers

54

and

56

, storage devices

78

and

80

, printer

84

and router

74

. Server

56

includes display

62

. Server

54

includes display

60

. All servers in the cluster, and substantially all operating components of each clustered server, are available to improve the cost/performance of the network.

LAN segment

50

includes a LAN connection labeled “LAN-2” between workstations

70

and

72

, server

58

and router

76

. Storage device

82

and printer

86

are connected to server

58

and are available locally as network resources to either of workstations

70

and

72

. LAN segment

50

is connected by means of router

76

to LAN segment

52

via router

74

. LAN segment

52

includes a LAN connection labeled LAN-

1

between workstations

66

and

68

, servers

54

-

56

and router

74

. Storage device

78

and printer

84

are connected to server

54

. Storage device

80

is connected to server

56

. Either of workstations

66

and

68

can connect via server

54

locally to printer

84

and storage device

78

. Either of workstations

66

and

68

can connect locally via server

56

to storage device

80

.

Each of the servers

54

,

56

and

58

includes respectively copies

88

A,

88

B, and

88

C of the replicated network directory database. Such a replicated network directory database is part of the NetWare Directory Services (NDS), is provided in Novell's NetWare 4.x product. The format and functioning of this database, which is a foundation for the current invention is described in greater detail in FIG.

2

.

FIG. 2

is a detailed block diagram of replica

88

A of the prior art replicated NetWare® Directory Services (NDS) database, such as is part of NetWare® 4.x. The replicated network directory database includes a directory tree

102

and a series of node and leaf objects within the tree. Leaf object

104

is referenced. In NDS physical devices are represented by objects or logical representations of physical devices. Users are logical user accounts and are one type of NDS object. The purpose of object oriented design is to mask the complexity of the physical configuration of the network from users and administrators. A good example of a logical device is a file server. A file server is actually a logical device, such as a NetWare® server operating system running on a physical device, a computer. A file system, is the logical file system represented in the file server's memory and then saved on the physical hard drive. In NDS, a file server is a type of object and so is a NetWare® volume. Objects represent physical resources, users and user related entities such as groups. Object properties are different types of information associated with an object. Property values are simply names and descriptions associated with the object properties. For example, HP

3

might be the property value for the printer name object property which is in turn associated with the printer object.

NDS uses a hierarchical tree structure to organize the various objects. Hence, the structure is referred to as the NDS tree. The tree is made up of these three types of objects: the root object, container objects and leaf objects. The location in which objects are placed in a tree is called a context or name context similar to a pointer inside a database. The context is of key importance; to access a resource the user object must be in the same context as the resource object. A user object has access to all objects that lie in the same directory and in child directories. The root object is the top of a given directory tree. Branches are made of contained objects. Within them are leaf objects. A crude analogy is the directory tree of your hard disk. There is a back slash or root at the base of the tree, each subdirectory can be prepared to a container object and files within the subdirectories can be compared to leaf objects in NDS. The root object is created automatically when you first install NDS. It cannot be renamed or deleted and there is only one root in a given NDS tree. Container objects provide a way to logically organize other objects in the NDS tree. A container object can house other container objects within it. The top container is called the parent object. Objects contained in a container object are said to be child objects. There are three types of parents or containers: organization, organizational unit and country. You must have at least one organization object within the NDS tree and it must be placed one level below root. The organization object is usually used to denote a company. Organizational units are optional. If you use them, they are placed one level below an organization object. Leaf objects include user, group, server, volume, printer, print queue and print server. Associated with each object is a set of object rights.

The directory tree is a distributed database that must make information available to user's connected to various parts of the network. Novell directory services introduces two new terms, partition and replica, to describe the mechanics of how the directory tree is stored. The directory tree is divided into partitions. A partition is a section of the database under an organization unit container object such as marketing. Novell directory services divides the network directory tree into partitions for several reasons. A replica is a copy of a directory tree partition. Having replicas of directory tree partitions on multiple servers has both good and bad points. Each replica must contain information that is synchronized with every corresponding replica. A change to one partition replica must be echoed to every other replica. The more replicas, the more time the network traffic is involved in keeping replicas synchronized.

The directory tree

102

in

FIG. 2

comprises a root node, a company node, organizational nodes, and leaf nodes. The company node is directly beneath the root node. Beneath the company node are two organizational unit nodes labeled Accounting and Marketing. Beneath each organizational unit nodes are a series of leaf nodes representing those network resources associated with the Accounting unit and the Marketing unit. The organizational unit labeled Accounting has associated with it the following leaf nodes: server

58

, storage device

82

, printer

86

, and users A and B [See FIG.

1

]. The organizational unit labeled Marketing has associated with it, the following leaf nodes: server

56

, server

54

, storage device

80

, storage device

78

, printer

84

, and users C and D [See FIG.

1

].

Each object has specific object properties and property values. As defined by Novell in their NetWare® 4.x, a volume object has three object properties: context, resource name, and server affiliation. Context refers to the location in the directory tree of the object and is similar in concept to a path statement. For example, printer

86

is a resource available to the Accounting Department, and not to the Marketing Department. The next object property is resource name. The resource name is a unique enterprise wide identifier for the resource/object. The next object property is host server affiliation. Host server affiliation is an identifier of the server through which the object/resource may be accessed.

The object/resource record

104

for storage device

80

is shown in FIG.

2

. The object/resource includes a context property

106

, a resource name property

108

, and host server affiliation property

110

. The context property has a context property value

106

A of: [Root]\Company\Marketing\. The resource name property value

108

A is: RAID-

80

. The host server affiliation property value

110

A is server

56

. A network administrator may add or delete objects from the tree. The network administrator may alter object property values. As discussed above, any changes made in the directory of one server are propagated/replicated across all servers in the enterprise.

Enhanced Directory+Server Processes

FIG. 3

is a hardware block diagram of an embodiment of the current invention in a local area network. Users A-D are shown interfacing via workstations

66

-

72

with network resources. The network resources include servers

54

-

58

, storage devices

78

-

82

and printers

84

-

86

, for example. The relationship between network resources is defined not only as discussed above in connection with

FIGS. 1-2

for normal operation, but also for operation in the event of a failure of any one of the network resources. This additional utility, i.e., fault tolerance is a result of enhancements to the network directory databases

150

A-C and processes

152

A-C resident on each server. The server resident processes operating in conjunction with the enhanced network directory database allow failure detection, resource/object remapping and recovery. Thus, network downtime is reduced by transparently remapping network resources in response to a detection of a failure. The remapped route is defined within the enhanced directory. The routes that are defined in the directory may be part of the initial administrative setup; or may be a result of an automatic detection process; or may be a result of real time arbitration between servers. The server resident processes

152

A-C have the additional capability of returning the resource/network to its initial configuration when the failed resource has been returned to operation. This latter capability is also a result of the interaction between the host resident processes

152

A-C and the enhanced network directory

150

A-C.

FIG. 4

is a detailed block diagram of the enhanced object/resource properties provided within the enhanced network directory database. Object

200

A within enhanced network directory database

150

A is shown. Object

200

A contains object properties and property values for the storage device RAID

80

shown in FIG.

3

. In addition to the three prior art object properties, i.e., context property

106

, resource name property

108

and host server affiliation property

110

, additional properties are defined for the object in the enhanced network directory database. Property

210

is the primary server affiliation for the resource, i.e., RAID-

80

. The property value

210

A for the primary server affiliation is server

56

. Thus, server

56

will normally handle network communications directed to RAID-

80

. Property

212

is the backup server affiliation for the resource. The property value

212

A for the backup server affiliation is server

54

. Thus server

54

will handle network communications with RAID-

80

in the event of a failure of server

56

. Cluster property

214

indicates whether the resource is cluster capable, i.e., can be backed up. The values for this property are boolean True or False value. The cluster property value in

FIG. 4

is boolean True which indicates that RAID

80

has physical connections to more than one server, i.e., servers

54

-

56

. Enable property

216

indicates whether a cluster capable object will be cluster enable, i.e., will be given a backup affiliation. The property values associated with this property are boolean True or False. The enable property value

216

A of boolean TRUE indicates that RAID

80

is cluster capable and that clustering/backup capability has been enabled. The optional auto-recover property

218

indicates whether the cluster enabled object is to be subject to automatic fail back and/or auto recovery. The auto recover property has the property values of boolean True or False. The auto-recover property value

218

A is TRUE which indicates that RAID

80

will fail back without user confirmation to server

56

when server

56

recovers. Prior state property

220

indicates the prior state of the resource. The property values associated with this are: OK, fail over in progress, fail over complete, fail back in process, fail back complete. The prior state property value

220

A of OK indicates that this resource RAID

80

has not failed. The priority property value

222

indicates the priority for fail over. The priority property may have values of 1, 2 or 3. This property may be utilized to stage the fail over of multiple resources where the sequencing of recovery is critical. The priority property value

222

A of “2” indicates that RAID

80

will have an intermediate staging in a recovery sequence. The hardware ID property value

224

is the unique serial identifier associated with each hardware resource. The hardware ID property value

224

A of :02:03:04:05 indicates that RAID

80

is comprised of four volumes, each with their own unique serial number identifier. Any object in the enhanced network directory database may be cluster/backed-up. Therefore the methods of certain embodiments of the current invention are equally applicable to printers, print queues, and databases as to storage devices.

FIGS. 5A-E

show a sequence of detection, fail-over and fail-back for storage device

80

. Storage device

80

may be affiliated physically with either one of two servers.

FIG. 5A

includes workstations

66

-

68

, servers

54

-

56

, storage devices

78

-

80

, printer

84

and router

74

. Server

54

includes a display

60

. Server

56

includes a display

62

. Workstations

66

and

68

are connected via LAN-

1

to router

74

and servers

54

and

56

. Server

56

is directly connected to storage device

80

. Server

54

is directly connected to printer

84

and storage device

78

. A connection

250

also exists between servers

54

-

56

and storage devices

78

-

80

. Servers

54

-

56

each contain replicas respectively,

150

B-A of the enhanced network directory database. Server

54

runs process

152

B for detection, fail-over and fail-back. Server

56

runs process

152

A for detection, fail-over and fail-back. In

FIG. 5A

, server

56

has a primary relationship with respect to storage device

80

. This relationship is determined by the object properties for storage device

80

in the replicated network directory database [see FIG.

4

]. Communication

252

flows between RAID

80

and server

56

. In the example shown, workstation

68

is communicating via server

56

with storage device

80

. Server

54

has a primary relationship with storage device

78

as indicated by communication marker

254

. This relationship is determined by the object properties for storage device

78

in the replicated network directory database.

FIG. 5B

shows an instance in which server

56

and the process resident thereon has failed, as indicated by the termination marks

256

and

258

. Communications via server

56

between workstation

68

and storage device

80

are terminated.

As shown in

FIG. 5C

, process

152

B running on server

54

has detected the failure of server

56

and has remapped communications between workstation

68

and storage device

80

via server

54

. This remapping is the result of the process

152

B running on server

54

. These processes have detected the failure of server

56

. They have determined on the basis of backup property values for storage device

80

stored in the enhanced network directory database

150

B that server

54

can provide backup capability for storage device

80

. Finally, they have altered the property values on the object/record for storage device

80

within the enhanced network directory database to cause communications with the storage device

80

to be re-routed through server

54

.

FIG. 5D

indicates that server

56

has resumed normal operation. Server

56

's replica

262

of the enhanced network directory database is stale or out of synchronization with the current replica

150

B contained in server

54

. The replica

262

is updated to reflect the changes in the server-to-storage device configuration brought about by the processes running on server

54

. Communications between workstation

68

and storage device

80

continue to be routed via server

54

, as indicated by communication marker

260

, because that server is listed on all replicas of the enhanced network directory database as the host server for resource/object identified as storage device

80

.

In

FIG. 5E

updated replicas

150

A-B of the network directory database are present on respectively servers

56

-

54

. Processes

152

A-B running on respectively servers

56

-

54

have caused the network to be reconfigured to its original architecture in which server

56

is the means by which workstation

68

, for example, communicates with storage device

80

. This fail-back is a result of several acts performed cooperatively by processes

152

A-B. Process

152

B detects re-enablement of server

56

. Process

152

B relinquishes ownership of storage device

80

by server

54

. Process

152

A running on server

56

, detects relinquishment of ownership of storage device

80

by server

54

and in response thereto updates the host server property value for resource/object storage device

80

in the replicated network directory database. Communications between workstation

68

and storage device

80

are re-established via server

56

, as indicated by communication marker

252

. All of these processes may take place transparently to the user, so as not to interrupt network service. During the period of fail-over, server

54

handles communications to both storage device

78

as well as storage device

80

.

FIGS. 6A-6E

shows the object properties and property values for storage device

80

, during the events shown at a hardware level in

FIGS. 5A-E

. These object properties and property values are contained in replicas

150

A-B of the enhanced network directory database and specifically stored in servers

56

-

54

. In

FIGS. 6A-E

, the object property values for storage device

80

are shown. On the left-hand side of each figure the object/record that is stored in server

56

is shown. On the right-hand side of each figure the object/record that is stored in server

54

is shown. The enhanced directory replicated on each server contains multiple objects representing all network resources. The enhanced directory on server

56

is shown to contain an additional object

202

A and the enhanced directory in server

54

is shown to contain an additional object

202

B, representative of the many objects within each enhanced directory database.

FIG. 6A

shows an initial condition in which the host server property variable

110

A/B and the primary server property value

210

A/B match. In

FIG. 6B

, server

54

has failed and therefore the replica of the enhanced network directory database and each of the objects within that directory are no longer available as indicated by failure mark

258

. Nevertheless, an up-to-date current replica of the enhanced network directory database is still available on server

56

as indicated by objects

200

B and

202

B on the right-hand side of FIG.

6

B. In

FIG. 6C

, the fail-over corresponding to that discussed above in connection with

FIG. 5C

is shown. The host server property value

110

B has been updated to reflect current network routing during the failure of server

56

. The host server property value

110

B is server

54

. Because the resource/object for storage device

80

appears on all servers as server

54

, all communications between workstations, i.e., workstation

68

are re-routed through server

54

to storage device

80

. The fail-over is accomplished by resident process

152

B on server

54

[see FIGS.

5

A-E]. These processes detect the failure of server

56

. Then they determine which server is listed in the resource/object record for storage device

80

as a backup. Next the processes write the backup property value to the host property value for storage device

80

. Replicas of this updated set of property value(s) for object

200

B, corresponding to the storage device

80

, are then replicated throughout the network. As indicated in

FIG. 6C

, the prior state property value

220

B is updated by the resident process

152

B [see FIGS.

5

A-E] to indicate that a fail-over has taken place.

As indicated in

FIG. 6D

, when server

54

first comes back on line it contains a stale, i.e., out of date copy of the property values for all objects including object

200

A corresponding to storage device

80

. The existing functionality of NetWare® 4.x causes this stale enhanced directory to be refreshed with a current property values generated, in this instance, by the resident process

152

B on server

54

. [see FIGS.

5

A-E]

FIG. 6E

indicates the completion of a fail-back. The resident process

152

B [see FIGS.

5

A-E] on server

54

detects resumption of operation by server

56

. These processes then relinquish ownership of storage device

80

. Then the resident process

152

B [see FIGS.

5

A-E] on server

54

asserts ownership of storage device

80

and rewrite the host server property value

110

B for storage device

80

to correspond to the primary server property value

210

A, which is server

56

. This updated record/object is again replicated in all the replicated network directory databases throughout all the servers on the network. Thus, all communications between workstations, i.e., workstation

68

and storage device

80

are routed through server

56

. [see FIGS.

5

A-E] The prior state property values

220

A-B are set to fail-back indicating that the prior state for the object is fail-back.

FIG. 7

is a block diagram of the modules resident on each server

56

which collectively accomplish the process

152

A associated with detection, fail-over and fail-back. Similar modules exist on each server. A server input unit

304

and display

62

are shown. Modules

306

-

316

are currently provided with network utilities such as NetWare® 4.x. These modules may interact with modules

320

-

328

in order to provide the resident process

152

A for detection, fail-over and fail-back. Module

306

may be a NetWare Loadable Module (NLM) which provides a graphical interface by which a user can interact with NetWare® 4.x and also with the resident process

152

A. Module

308

may be a communication module which provides connection oriented service between servers. A connection oriented service is one that utilizes an acknowledgment packet for each package sent. Module

310

may include client base applications which allow a user at a workstation to communicate

330

directly with both the network software, as well as with the resident process

152

A. Module

150

A is a replica of the enhanced network directory database which includes the additional object properties discussed above in

FIGS. 3-4

. Module

312

, identified as Vol-Lib, is a loadable module which provides volume management services including scanning for volumes, mounting volumes and dismounting volumes. Module

314

is a media manager module which allows a server to obtain identification numbers for all resources which are directly attached to the server. Module

316

is a peripheral attachment module which allows the server to communicate with devices such as storage devices or printers which may be direct attached to it. Module

320

provides an application programming interface (API) which allows additional properties to be added to each object in the enhanced network directory database. This module also allows the property values for those additional properties to be viewed, altered, or updated.

Modules

322

-

328

may interact with the above discussed modules to provide the server resident processes for detection, fail-over and fail-back. Module

322

may handle communications with a user through network user terminal module

306

. Module

322

may also be responsible for sending and receiving packets through NCP module

308

to manage failure detection and recovery detection of a primary server. Module

324

, the directory services manager, may be responsible for communicating through module

320

with the enhanced network directory database

150

A. Module

324

controls the addition of properties as well as the viewing, and editing of property values within that database. Module

326

is a device driver which in a current embodiment superimposes a phase shifted signal on the peripheral communications between a server and its direct connected resources to detect server failure. Module

326

sends and receives these phase shifted signals through module

316

. Module

328

controls the overall interaction of modules

322

-

326

. In addition, module

328

interfaces with module

312

to scan, mount and dismount objects/resources. Furthermore, module

328

interacts with module

314

to obtain device hardware identifiers for those devices which are direct attached to the server. The interaction of each of these modules to provide for detection, fail-over and fail-back will be discussed in detail in the following

FIGS. 8A-C

.

FIGS. 8A-C

show an embodiment of the processes for detection, fail-over and fail-back which are resident on all servers.

FIG. 8A

identifies the initial processes on both primary and backup servers for creating and authenticating objects. The authentication of an object/resource involves determining its cluster capability, and its primary and backup server affiliation.

FIG. 8B

indicates processes corresponding to the failure detection and fail-over portion of the processes.

FIG. 8C

corresponds to those processes associated with recovery and fail-back.

FIGS. 8A-C

each have a left-hand and a right-hand branch identified as primary and backup branch. Since each server performs as a primary with respect to one object and a secondary with respect to another object, it is a characteristic of the resident processes that they will run alternately in a primary and a backup mode depending on the particular object being processed. For example, when an object being processed has a primary relationship with respect to the server running the processes, then the processes in

FIGS. 8A-C

identified as primary will be conducted. Alternately, when an object being processed has a secondary relationship with respect to the server running the processes, then the processes in

FIGS. 8A-C

identified as backup will be conducted. An object which has neither a primary nor backup relationship with the server running the process will not be subject to detection, fail-over or fail-back processing.

FIG. 8A

sets forth an embodiment of the authentication process. During authentication, specific primary and secondary server relationships are established for a specific network object. And those relationships are written into the property values associated with that object in the enhanced network directory database. The process begins at process

350

in which a new object is created and property values for context, resource name and server affiliation are defined for that object. Control is then passed to process

352

in which the additional properties discussed above in connection with

FIG. 4

are added to that object's definition. Then in process

354

default values for a portion of the new expanded properties are added.

There are several means by which default values can be obtained. In an embodiment default values are completely defined by a network administrator for all expanded properties at the time of object creation. In another embodiment, the one shown in

FIG. 8A

, only minimal default values are initially defined and the server resident processes on the primary and secondary server define the rest. In either event, the newly created object is added to the expanded network directory and replicated throughout the network.

In

FIG. 8A

only the capable, enabled, auto-recover and priority properties are defined. Capable, enabled and auto-recover are all set to boolean False and priority is set to a value of “2”. The partial definition of property values is feasible only because the hardware environment corresponds to that shown in

FIGS. 5A-E

and allows for an auto discovery process by which the property values can be automatically filled in by the primary and secondary server resident processes. These processes will now be explained in greater detail. Control then passes to process

356

. In process

356

, the data manager module acting through the vol-lib module [see FIG.

7

] obtains the newly created object. Control is then passed to process

358

. In process

358

, the data manager module acting through the media module [see FIG.

7

] determines the hardware IDs for all the devices direct connected to the server. Control is then passed to process

360

. In process

360

the values for primary server property

210

and the hardware IDs

224

[see FIG.

4

] in the expanded fields for the newly created object are filled in. The value for the primary server property

210

is equal to the ID of the server running the process. The value for the hardware ID property

224

is set equal to those hardware IDs obtained in process

358

which correspond to the hardware IDs of the object. Control is then passed to decision process

362

. In process

362

a determination is made as to whether the user desires to create a new object. In the event that determination is in the affirmative control returns to process

350

.

Each server also runs backup authentication processes which are shown on the right side of FIG.

8

A. These processes commence at process

370

. In process

370

the data manager through the media manager [see FIG.

7

] scans locally for all devices which are directed attached to the server. This local scan produces hardware IDs for those objects to which the server is direct attached. Control is then passed to process

372

. In process

372

the data manager through Vol Libs [see FIG.

7

] obtains globally via the enhanced NDS database a list of all objects with hardware IDs which match those retrieved in process

370

. Those objects which have hardware IDs matching those produced in the local scan conducted in process

370

are passed to decision process

376

. In decision process

376

a determination is made as to which among those objects have a host server property field with a server ID corresponding to the ID of the server running these backup processes. In the event that determination is in the affirmative control is passed to process

374

in which the next object in the batch passed from process

372

is selected. Control then returns to decision process

376

. In decision process

376

objects in which the host server property ID does not match the ID of the server running the process are passed to process

378

. In process

378

the server running the process has identified itself as one which can serve as a backup for the object being processed. Thus, the ID of the server running the process is placed in the backup server field

212

B [see FIG.

4

]. Additionally the objects cluster capable and cluster enabled fields

214

B-

216

B are set to a boolean True condition. The autorecover field

218

B is set to boolean True as well. The priority, state and previous state fields

222

B and

220

B are also filled in with default values. Control is then passed to decision process

380

. In decision process

380

a determination is made as to whether the user with administrative privileges wishes to change the default values. In the event that determination is in the negative controller returns to process

370

. Alternately, if a determination is in the affirmative is reached then control is passed to process

382

. In process

382

an administrator may disable the cluster enabled, autorecover and priority fields

216

B,

218

B and

222

B. Control then returns also to process

370

.

FIG. 8B

shows the failure detection and fail-back portions of both the primary and backup processes. The processes for a server performing as a primary with respect to an object commence with splice block A. From splice block A control passes to process

398

. In process

398

a drive pulse protocol is initiated. The drive pulse protocol is appropriate for those objects which are connected to the server by a bus, a Small Computer Storage Interconnect (SCSI) bus with multiple initiators, or any other means of connection. For example, in

FIGS. 5A-5E

, connection

250

connects both servers

54

and

56

to storage device

80

. The drive pulse protocol across connection

250

enables the secondary server to sense primary server failure, as will be discussed shortly in connection with processes

402

-

408

. The drive pulse protocol works by the current host, by some prearranged schedule, continuously issuing SCSI “release” and “reserve” commands, to allow the backup to detect the failure of the primary. The backup detects these commands being issued by the primary by continuously sending a “test unit ready”. Control is then passed to process

400

. Process

400

indicates that the primary continues to perform its portion of the drive pulse protocol until there is a failure of the primary in which case control passes to splice block C.

The processes run on the backup server in connection with failure-detection and fail-over are initiated at splice block B, which is shown on the right-hand side of FIG.

8

B. Control passes from splice block B to processes

402

-

404

. In process

402

the backup server continually monitors the LAN communication between itself and the primary server to determine when the primary server has failed. It does this by determining the primary server ID from the host server property value

110

A [FIG.

4

]. This object property ID is appended by the LAN detector module

322

to network control protocol packets. These packets are sent intermittently by the network control protocol module

308

[see FIG.

7

] on the backup server to the primary server to determine when the primary server fails. Control is then passed to decision process

406

. In decision process

406

the backup server monitors across connection

250

[see FIGS.

5

A-E] the drive pulse discussed above in connection with process

400

. These pulses can be used to determine when the connection from the primary server to the storage device has failed. Control then passes to decision process

406

.

In decision process

406

, a determination is made as to whether on the basis of LAN communications, the primary server has failed. In the event this determination is in the negative, control returns to processes

402

and

404

. Alternately, if this determination is in the affirmative i.e., that the primary server is no longer responding to the secondary server's NCP packets, then control is passed to decision process

408

. In decision process

408

, a determination is made as to whether the drive pulse from the primary is still being received by the secondary server across connection

250

. If a determination is made that the communication between the primary server and the storage device

80

has not failed, i.e., that the drive monitor is still detecting drive pulses from the primary, then control returns to processes

402

and

404

. This secondary drive detection assures that a momentary LAN failure will not result in the determination that the primary server has failed when in fact that primary server still is communicating with the resource/object such as storage device

80

[See FIGS.

5

A-E]. In the alternative, if determination is reached in decision process

408

that the primary server is no longer communicating with the resource/object, then control is passed to the process

410

. In process

410

the user is notified of the failure of a primary server. The notification occurs through the cooperative operation of modules

328

,

322

and

308

discussed above in connection with FIG.

7

. Control is then passed to process

412

. In process

412

the secondary server activates the object and passes control to process

414

. In process

414

the secondary server mounts the object i.e., physically assumes control over the object. Control is then passed to process

416

in which the secondary server writes into the host server property value

110

A the value for its ID in place of the primary server ID. This new property value is then replicated across all enhanced network directory databases on all the servers in the enterprise. Thus, a failure has been detected and transparently to the user an alternate path for communications between workstations and the object, e.g. storage device

80

, through the secondary server, e.g. server

54

. [See FIGS.

5

A-E]. Control then passes to process

418

in which the object is reserved by the backup server.

Although in the example shown in

FIGS. 5A-E

the object backed up is a storage device, the invention can be applied with equal benefit to many objects including but not limited to: a printer, a print queue, a directory and a database.

FIG. 8C

details the recovery and fail-back processes on the servers which have a primary and backup relationship with respect to a specific object being processed. The server which has a backup relationship initiates the recovery fail-back process at splice block D. Control then passes to process

458

in which the backup server initiates a LAN heartbeat to enable it to determine whether the primary server has resumed normal operation. This LAN beat was discussed above in connection with process

402

[see FIG.

8

B]. Control is then passed to decision process

460

. In decision process

460

a determination is made on the basis of the LAN beat as to whether or not the primary server has recovered. If this determination is in the negative, then control returns to process

458

. Alternately, if the determination in made in the affirmative i.e., that the primary has recovered, then control passes to decision process

462

.

In decision process

462

, a determination is made as to whether the autorecover property value

218

A [see FIG.

4

] is enabled, i.e., boolean TRUE. In the event this determination is in the negative, then control is passed to process

464

. In process

464

, the user or network administrator is prompted with the news of a recovery and a request for direction as to whether to initiate fail-back. Control is then passed to decision process

466

. In decision process

466

a determination is made as to whether the user response was in the affirmative. In the event that determination is in the negative, control returns to process

464

. Alternately, if that determination is in the affirmative, i.e., the user has indicated that fail-back is appropriate, then control passes to process

468

. Alternately, if in decision process

462

a determination is made in the affirmative, i.e., that autorecovery has been enabled, then control also passes to process

468

. In process

468

, the backup server dismounts the object. An object dismount is accomplished by the backup server through the cooperative interaction of data manager module

328

and Vol-Lib module

312

. [See FIG.

7

] Control then passes to process

470

. In process

470

, the backup server deactivates the object. Control is then passed to process

472

in which the object is released. Control then passes to splice block B.

The processes for recovery and fail back as performed on a server designated as primary with respect to a specific object being processed commences at splice block C. Control then passes to decision process

444

. In decision process

444

a primary reboot passes control to process

446

. In process

446

the data manager module acting through the vol-lib module [see FIG.

7

] obtains all objects which have a primary server ID corresponding to the server running the process. Control is then passed to decision process

450

. In decision process

450

objects which are reserved are bypassed by passing control to process

448

for selection of the next object. As to those objects which are not reserved control passes to process

452

. In process

452

the object is activated. Control is then passed to process

454

in which the object is mounted. Control is then passed to process

456

in which the primary server modifies the host server property value

110

A [see FIG.

4

] with respect to that object and writes its own ID into the host server property value. Control is then passed to splice block A.

The foregoing description of a preferred embodiment of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Obviously many modifications and variations will be apparent to practitioners skilled in this art. It is intended that the scope of the invention be defined by the following claims and their equivalents.

Glossary

Fail-Over refers to the process of passing data flow control from a failed primary device to a substitute or backup device.

Fail-Back refers to the process of re-passing data flow control from a substitute or backup device to a primary device which has resumed operation.

Appendix A

Incorporation by Reference of Commonly Owned Applications

The following patent applications, commonly owned and filed Oct. 1, 1997, are hereby incorporated herein in their entirety by reference thereto:

Application

Attorney Docket

Title

No.

No.

“System Architecture for Remote

08/942,160

MNFRAME.002A1

Access and Control of Environmental

Management”

“Method of Remote Access and

08/942,215

MNFRAME.002A2

Control of Environmental

Management”

“System for Independent Powering of

08/942,410

MNFRAME.002A3

Diagnostic Processes on a Computer

System”

“Method of Independent Powering of

08/942,320

MNFRAME.002A4

Diagnostic Processes on a Computer

System”

“Diagnostic and Managing

08/942,402

MNFRAME.005A1

Distributed Processor System”

“Method for Managing a

08/942,448

MNFRAME.005A2

Distributed Processor System”

“System for Mapping Environmental

08/942,222

MNFRAME.005A3

Resources to Memory for Program

Access”

“Method for Mapping Environmental

08/942,214

MNFRAME.005A4

Resources to Memory for Program

Access”

“Hot Add of Devices Software

08/942,309

MNFRAME.006A1

Architecture”

“Method for The Hot Add of

08/942,306

MNFRAME.006A2

Devices”

“Hot Swap of Devices Software

08/942,311

MNFRAME.006A3

Architecture”

“Method for The Hot Swap of

08/942,457

MNFRAME.006A4

Devices”

“Method for the Hot Add of a

08/943,072

MNFRAME.006A5

Network Adapter on a System

Including a Dynamically Loaded

Adapter Driver”

“Method for the Hot Add of a

08/942,069

MNFRAME.006A6

Mass Storage Adapter on a System

Including a Statically Loaded

Adapter Driver”

“Method for the Hot Add of a

08/942,465

MNFRAME.006A7

Network Adapter on a System

Including a Statically Loaded

Adapter Driver”

“Method for the Hot Add of a

08/962,963

MNFRAME.006A8

Mass Storage Adapter on a System

Including a Dynamically Loaded

Adapter Driver”

“Method for the Hot Swap of a

08/943,078

MNFRAME.006A9

Network Adapter on a System

Including a Dynamically Loaded

Adapter Driver”

“Method for the Hot Swap of a

08/942,336

MNFRAME.006A-

Mass Storage Adapter on a System

10

Including a Statically Loaded

Adapter Driver”

“Method for the Hot Swap of a

08/942,459

MNFRAME.006A-

Network Adapter on a System

11

Including a Statically Loaded

Adapter Driver”

“Method for the Hot Swap of a

08/942,458

MNFRAME.006A-

Mass Storage Adapter on a System

12

Including a Dynamically Loaded

Adapter Driver”

“Method of Performing an Extensive

08/942/463

MNFRAME.008A

Diagnostic Test in Conjunction with

a BIOS Test Routine”

“Apparatus for Performing an

08/942,463

MNFRAME.009A

Extensive Diagnostic Test in

Conjunction with a BIOS Test

Routine”

“Configuration Management Method

08/941,268

MNFRAME.010A

for Hot Adding and Hot Replacing

Devices”

“Configuration Management System

08/942,408

MNFRAME.011A

for Hot Adding and Hot Replacing

Devices”

“Apparatus for Interfacing Buses”

08/942,382

MNFRAME.012A

“Method for Interfacing Buses”

08/942,413

MNFRAME.013A

“Computer Fan Speed Control

08/942,447

MNFRAME.016A

Device”

“Computer Fan Speed Control

08/942,216

MNFRAME.017A

Method”

“System for Powering Up and

08/943,076

MNFRAME.018A

Powering Down a Server”

“Method for Powering Up and

08/943,077

MNFRAME.019A

Powering Down a Server”

“System for Resetting a Server”

08/942,333

MNFRAME.020A

“Method for Resetting a Server”

08/942,405

MNFRAME.021A

“System for Displaying Flight

08/942,070

MNFRAME.022A

Recorder”

“Method for Displaying Flight

08/942,068

MNFRAME.023A

Recorder”

“Synchronous Communication

08/943,355

MNFRAME.024A

Interface”

“Synchronous Communication

08/942,004

MNFRAME.025A

Emulation”

“Software System Facilitating

08/942,317

MNFRAME.026A

the Replacement or Insertion of

Devices in a Computer System”

“Method for Facilitating the

08/942,316

MNFRAME.027A

Replacement or Insertion of

Devices in a Computer System”

“System Management Graphical

08/943,357

MNFRAME.028A

User Interface”

“Display of System Information”

08/942,195

MNFRAME.029A

“Data Management System

08/942,129

MNFRAME.030A

Supporting Hot Plug Operations

on a Computer”

“Data Management Method

08/942,124

MNFRAME.031A

Supporting Hot Plug Operations

on a Computer”

“Alert Configurator and Manager”

08/942,005

MNFRAME.032A

“Managing Computer System Alerts”

08/943,356

MNFRAME.033A

“Computer Fan Speed Control

08/940,301

MNFRAME.034A

System”

“Computer Fan Speed Control

08/941,267

MNFRAME.035A

System Method”

“Black Box Recorder for

08/942,381

MNFRAME.036A

Information System Events”

“Method of Recording Information

08/942,164

MNFRAME.037A

System Events”

“Method for Automatically

08/942,168

MNFRAME.040A

Reporting a System Failure in a

Server”

“System for Automatically

08/942,384

MNFRAME.041A

Reporting a System Failure in a

Server”

“Expansion of PCI Bus Loading

08/942,404

MNFRAME.042A

Capacity”

“Method for Expanding of PCI

08/942,223

MNFRAME.043A

Bus Loading Capacity”

“System for Displaying System

08/942,347

MNFRAME.044A

Status”

“Method of Displaying System

08/942,071

MNFRAME.045A

Status”

“Fault Tolerant Computer

08/942,194

MNFRAME.046A

System”

“Method for Hot Swapping of

08/943,044

MNFRAME.047A

Network Components”

“A Method for Communicating a

08/942,221

MNFRAME.048A

Software Generated Pulse Waveform

Between Two Servers in a Network”

“A System for Communicating a

08/942,409

MNFRAME.049A

Software Generated Pulse Waveform

Between Two Servers in a Network”

“Method for Clustering Software

08/942,318

MNFRAME.050A

Applications”

“System for Clustering Software

08/942,411

MNFRAME.051A

Applications”

“Method for Automatically

08/942,319

MNFRAME.052A

Configuring a Server after Hot

Add of a Device”

“System for Automatically

08/942,331

MNFRAME.053A

Configuring a Server after Hot

Add of a Device”

“Method of Automatically

08/942,412

MNFRAME.054A

Configuring and Formatting a

Computer System and Installing

Software”

“System for Automatically

08/941,955

MNFRAME.055A

Configuring and Formatting a

Computer System and Installing

Software”

“Determining Slot Numbers in

08/942,462

MNFRAME.056A

a Computer”

“System for Detecting Errors

08/942,169

MNFRAME.058A

in a Network”

“Method of Detecting Errors

08/940,302

MNFRAME.059A

in a Network”

“System for Detecting Network

08/942,407

MNFRAME.060A

Errors”

“Method of Detecting Network

08/942,573

MNFRAME.061A

Errors”

Claims

1. A method for fault tolerant access to a network resource, on a network with a client workstation and a first and a second server, said method for fault tolerant access comprising the acts of:selecting a first server to provide communications between a client workstation and a network resource; detecting a failure of the first server, comprising the acts of: monitoring across a common bus, at a second server, communications between the first server and the network resource across the common bus by noting a continual change in state of the network resource, and observing a termination in the communications between the first server and the network resource across the common bus by noting a stop in the continual change in state of the network resource; and routing communications between the client workstation and the network resource via the second server.
2. The method for fault tolerant access to a network resource of claim 1, further comprising the acts of:identifying in a first record, the primary server for the network resource as the first server; discovering a recovery of the first server; and re-routing communications between the client workstation and the network resource via the first server.
3. The method for fault tolerant access to a network resource of claim 2, further comprisingproviding a network resource database; and replicating the network resource database on the first and the second servers.
4. The method for fault tolerant access to a network resource of claim 2, wherein said act of detecting a recovery of the first server, further includes the acts of:sending packets intermittently from the second server to the first server; and re-acquiring acknowledgments from the first server at the second server, the acknowledgments responsive to said sending act and to the recovery of said first server.
5. The method for fault tolerant access to a network resource of claim 1, further comprising:choosing the first server as the primary server and the second server as the backup server, for the network resource; and storing in a first field of the first record the primary server for the network resource and sorting in a second field of the first record the backup server for the network resource.
6. The method for fault tolerant access to a network resource of claim 5, wherein said choosing act, includes the act of:allowing a network administrator to select the primary and the backup server.
7. The method for fault tolerant access to a network resource of claim 5, wherein said act of detecting a failure of the first server, further includes the acts of:reading the second field in the first record of the network resource database; determining on the basis of said reading act that the second field identifies the backup server for the network resource as the second server; activating the monitoring by the second server of the first server, in response to said determining act; and ascertaining at the second server a failure of the first server.
8. The method for fault tolerant access to a network resource of claim 6, wherein said act of ascertaining at the second server a failure of the first server, further includes the acts of:sending packets intermittently from the second server to the first server; receiving acknowledgments from the first server at the second server, the acknowledgments responsive to said sending act; and noticing a termination in the receipt of acknowledgments from the first server.
9. The method for fault tolerant access to a network resource of claim 5, wherein said act of recognizing the backup server for the network resource, further includes the acts of:reading the second field in the first record of the network resource database; and determining that the second field identifies the backup server for the network resource as the second server.
10. A program storage device encoding instructions for:causing a computer to provide a network resource database, the database including individual records corresponding to network resources, and the network resource database including a first record corresponding to the network resource and the first record identifying a primary server for the network resource as a first server; causing a computer to select, on the basis of the first record, the first server to provide communications between a client workstation and the network resource; causing a computer to recognize the backup server for the network resource as the second server; causing a computer to detect a failure of the first server, including: causing a computer to monitor across a common bus, at the second server, communications between the first server and the network resource across the common bus by noting a continual change in state of the network resource, and causing a computer to observe a termination in the communications between the first server and the network resource across the common bus by noting a stop in the continual change in state of the network resource; and causing a computer to route communications between the client workstation and the network resource via the second server, responsive to said recognizing and detecting acts.
11. The program storage device of claim 10, further comprising instructions for:causing a computer to identify in the first record, the primary server for the network resource as the first server; causing a computer to discover a recovery of the first server; and causing a computer to re-route communications between the client workstation and the network resource via the first server, responsive to said identifying and discovering acts.
12. The program storage device of claim 11, further including instructions for:causing a computer to replicate a network resource database on the fist and the second server.
13. The program storage device of claim 11, wherein said instructions for causing a computer to detect a recovery of the first server, further includes instructions for:causing a computer to send packets intermittently from the second server to the first server; and causing a computer to acquire acknowledgments from the first server at the second server, the acknowledgments responsive to said sending act and to the recovery of the first server.
14. The program storage device of claim 13, further including instructions for:causing a computer to choose the first server as the primary server and the second server as the backup server, for the network resource; and causing a computer to store in a first field of the first record the primary server for the network resource and storing in a second field of the first record the backup server for the network resource.
15. The program storage device of claim 14, wherein said instructions for causing a computer to choose, further include:causing a computer to allow a network administrator to select the primary and the backup server.
16. The program store device of claim 14, wherein said instructions for causing a computer to detect a failure of the first server further include instructions for:causing a computer to read the second field in the first record of the network resource database; causing a computer to determine on the basis of said reading act that the second field identifies the backup server for the network resource as the second server; causing a computer to activate the monitoring by the second server of the first server, in response to said determining act; and causing a computer to ascertain at the second server a failure of the first server.
17. The program storage device of claim 16, wherein said instructions for causing a computer to ascertain at the second server a failure of the first server, further includes instructions for:causing a computer to send packets intermittently from the second server to the first server; causing a computer to receive acknowledgments from the first server at the second server, the acknowledgments responsive to said sending act; and causing a computer to notice a termination in the receipt of acknowledgments from the first server.
18. The program storage device of claim 14, wherein said instructions for causing a computer to recognize the backup server for the network resource, further include instructions for:causing a computer to read the second field in the first record of the network resource database; and causing a computer to determine on the basis of said reading act that the second field identifies the backup server for the network resource as the second server.
19. A method for providing fault tolerant access to a network resource, on a network with a client workstation and a first and a second server and a network resource database, wherein the network resource database includes a first record corresponding to a network resource and the first record includes a first field containing the name of the network resource and a second field containing the host server affiliation of the network resource; said method for fault tolerant access comprising the acts of:expanding the network resource database to include a third field for naming the primary server affiliation for the network resource and a fourth field for naming the backup server affiliation for the network resource; naming the first server in the third field; selecting, on the basis of the first record, the first server to provide communications between the client workstation and the network resource; naming the second server in the fourth field; recognizing, on the basis of the fourth field of the first record, the backup server for the network resource as the second server; detecting a failure of the first server, including the acts of: monitoring across a common bus, at the second server, communications between the first server and the network resource across the common bus by noting a continual change in state of the network resource, and observing a termination in the communications between the first server and the network resource across the common bus by noting a stop in the continual change in state of the network resource; and routing communications between the client workstation and the network resource via the second server, responsive to said recognizing and detecting acts.
20. The method for fault tolerant access to a network resource of claim 19, further comprising the acts of:monitoring the server named in the third field; discovering a recovery of the server named in the third field; and re-routing communications between the client workstation and the network resource via the first server, responsive to said monitoring and discovering acts.
21. The method for fault tolerant access to a network resource of claim 20, wherein said act of discovering a recovery of the first server, further includes the acts of:sending packets intermittently from the second server to the first server; and re-acquiring acknowledgments from the first server at the second server, the acknowledgments responsive to said sending act and to the recovery of said first server.
22. The method for fault tolerant access to a network resource of claim 19, wherein said naming acts, include the acts of:allowing a network administrator to name the primary server affiliation and the backup server affiliation in the third and fourth fields of the first record of the network resource database.
23. The method for fault tolerant access to a network resource of claim 19, wherein said act of detecting a failure of the first server, further includes the acts of sendingsending packets intermittently from the second server to the first server; receiving acknowledgments from the first server at the second server, the acknowledgments responsive to said sending act; and noticing a termination in the receipt of acknowledgments from the first server.
24. A computer usable medium having computer readable program code means embodied therein for causing fault tolerant access to a network resource on a network with a client workstation and a first and second server, and a network resource database, wherein the network resource database includes a first record corresponding to a network resource and the first record includes a first field containing the name of the network resource and a second field containing the host server affiliation of the network resource; the computer readable program code means in said article of manufacture comprising;computer readable program code means for causing a computer to expand the network resource database to include a third field for naming the primary server affiliation for the network resource and a fourth field for naming the backup server affiliation for the network resource; computer readable program code means for causing a computer to name the first server in the third field; computer readable program code means for causing a computer to select, on the basis of the first record, the first server to provide communications between the client workstation and the network resource; computer readable program code means for causing a computer to name the second server in the fourth field; computer readable program code means for causing a computer to recognize, on the basis of the fourth field of the first record, the backup server for the network resource as the second server; computer readable program code means for monitoring across a common bus, at a second server, communications between the first server and the network resource across the common bus by noting a continual change in state of the network resource, and observing a termination in the communications between the first server and the network resource across the common bus by noting a stop in the continual change in state of the network resource; and computer readable program code means for causing a computer to route communications between the client workstation and the network resource via the second server, responsive to said recognizing and detecting acts.
25. The computer readable program code means in said article of manufacture of claim 24, further comprising:computer readable program code means for causing a computer to monitor the server named in the third field; computer readable program code means for causing a computer to discover a recovery of the server named in the third field; and computer readable program code means for causing a computer to re-route communications between the client workstation and the network resource via the first server, responsive to said monitoring and discovering acts.
26. The computer readable program code means in said article of manufacture of claim 25, wherein said computer readable program code means for causing a computer to discover a recovery, further includes:computer readable program code means for causing a computer to send packets intermittently from the second server to the first server; and computer readable program code means for causing a computer to re-acquire acknowledgments from the first server at the second server, the acknowledgments responsive to said sending act and to the recovery of said first server.
27. The computer readable program code means in said article of manufacture of claim 24, wherein said computer readable program code means for causing a computer to name, further includes:computer readable program code means for causing a computer to allow a network administrator to name the primary server affiliation and the backup server affiliation in the third and fourth fields of the first record of the network resource database.
28. The computer readable program code means in said article of manufacture of claim 24, wherein said computer readable program code means for causing a computer to detect a failure, further includes:computer readable program code means for causing a computer to send packets intermittently from the second server to the first server; computer readable program code means for causing a computer to receive acknowledgments from the first server at the second server, the acknowledgments responsive to said sending act; and computer readable program code means for causing a computer to notice a termination in the receipt of acknowledgments from the first server.

PRIORITY

The benefit under 35 U.S.C. §119(e) of the following U.S. Provisional Application entitled “Clustering Of Computer Systems Using Uniform Object Naming And Distributed Software For Locating Objects,” application Ser. No. 60/046,327, filed on May 13, 1997, is hereby claimed.

US Referenced Citations (309)

Number	Name	Date
4057847	Lowell et al.	Nov 1977
4100597	Fleming et al.	Jul 1978
4449182	Rubinson et al.	May 1984
4672535	Katzman et al.	Jun 1987
4692918	Elliott et al.	Sep 1987
4695946	Andreasen et al.	Sep 1987
4707803	Anthony, Jr. et al.	Nov 1987
4769764	Levanon	Sep 1988
4774502	Kimura	Sep 1988
4821180	Gerety et al.	Apr 1989
4835737	Herrig et al.	May 1989
4894792	Mitchell et al.	Jan 1990
4949245	Martin et al.	Aug 1990
4999787	McNally et al.	Mar 1991
5006961	Monico	Apr 1991
5007431	Donehoo, III	Apr 1991
5033048	Pierce et al.	Jul 1991
5051720	Kittirutsunetorn	Sep 1991
5073932	Yossifor et al.	Dec 1991
5103391	Barrett	Apr 1992
5118970	Olson et al.	Jun 1992
5121500	Arlington et al.	Jun 1992
5123017	Simpkins et al.	Jun 1992
5136708	Lapourtre et al.	Aug 1992
5136715	Hirose et al.	Aug 1992
5138619	Fasang et al.	Aug 1992
5157663	Major et al.	Oct 1992
5210855	Bartol	May 1993
5222897	Collins et al.	Jun 1993
5245615	Treu	Sep 1993
5247683	Holmes et al.	Sep 1993
5253348	Scalise	Oct 1993
5261094	Everson et al.	Nov 1993
5265098	Mattson et al.	Nov 1993
5266838	Gerner	Nov 1993
5269011	Yanai et al.	Dec 1993
5272382	Heald et al.	Dec 1993
5272584	Austruy et al.	Dec 1993
5276814	Bourke et al.	Jan 1994
5276863	Heider	Jan 1994
5277615	Hastings et al.	Jan 1994
5280621	Barnes et al.	Jan 1994
5283905	Saadeh et al.	Feb 1994
5307354	Cramer et al.	Apr 1994
5311397	Harshberger et al.	May 1994
5311451	Barrett	May 1994
5317693	Cuenod et al.	May 1994
5329625	Kannan et al.	Jul 1994
5337413	Lui et al.	Aug 1994
5351276	Doll, Jr. et al.	Sep 1994
5367670	Ward et al.	Nov 1994
5379184	Barraza et al.	Jan 1995
5379409	Ishikawa	Jan 1995
5386567	Lien et al.	Jan 1995
5388267	Chan et al.	Feb 1995
5402431	Saadeh et al.	Mar 1995
5404494	Garney	Apr 1995
5423025	Goldman et al.	Jun 1995
5430717	Fowler et al.	Jul 1995
5430845	Rimmer et al.	Jul 1995
5432715	Shigematsu et al.	Jul 1995
5432946	Allard et al.	Jul 1995
5438678	Smith	Aug 1995
5440748	Sekine et al.	Aug 1995
5448723	Rowett	Sep 1995
5455933	Schieve et al.	Oct 1995
5460441	Hastings et al.	Oct 1995
5463768	Schieve et al.	Oct 1995
5465349	Geronimi et al.	Nov 1995
5471617	Farrand et al.	Nov 1995
5471634	Giorgio et al.	Nov 1995
5473499	Weir	Dec 1995
5483419	Kaczeus, Sr. et al.	Jan 1996
5485550	Dalton	Jan 1996
5485607	Lomet et al.	Jan 1996
5487148	Komori et al.	Jan 1996
5491791	Glowny et al.	Feb 1996
5493574	McKinley	Feb 1996
5493666	Fitch	Feb 1996
5513314	Kandasamy et al.	Apr 1996
5513339	Agrawal et al.	Apr 1996
5515515	Kennedy et al.	May 1996
5517646	Piccirillo et al.	May 1996
5519851	Bender et al.	May 1996
5526289	Dinh et al.	Jun 1996
5528409	Cucci et al.	Jun 1996
5530810	Bowman	Jun 1996
5533193	Roscoe	Jul 1996
5533198	Thorson	Jul 1996
5535326	Baskey et al.	Jul 1996
5539883	Allon et al.	Jul 1996
5542055	Amini et al.	Jul 1996
5546272	Moss et al.	Aug 1996
5548712	Larson et al.	Aug 1996
5555510	Verseput et al.	Sep 1996
5559764	Chen et al.	Sep 1996
5559958	Farrand et al.	Sep 1996
5559965	Oztaskin et al.	Sep 1996
5560022	Dunstan et al.	Sep 1996
5564024	Pemberton	Oct 1996
5566299	Billings et al.	Oct 1996
5566339	Perholtz et al.	Oct 1996
5568610	Brown	Oct 1996
5568619	Blackledge et al.	Oct 1996
5572403	Mills	Nov 1996
5577205	Hwang et al.	Nov 1996
5579487	Meyerson et al.	Nov 1996
5579491	Jeffries et al.	Nov 1996
5579528	Register	Nov 1996
5581712	Herrman	Dec 1996
5581714	Amini et al.	Dec 1996
5584030	Husak et al.	Dec 1996
5586250	Carbonneau et al.	Dec 1996
5588121	Reddin et al.	Dec 1996
5588144	Inoue et al.	Dec 1996
5592610	Chittor	Jan 1997
5592611	Midgely et al.	Jan 1997
5596711	Burckhartt et al.	Jan 1997
5598407	Bud et al.	Jan 1997
5602758	Lincoln et al.	Feb 1997
5604873	Fite et al.	Feb 1997
5606672	Wade	Feb 1997
5608865	Midgely et al.	Mar 1997
5608876	Cohen et al.	Mar 1997
5615207	Gephardt et al.	Mar 1997
5621159	Brown et al.	Apr 1997
5621892	Cook	Apr 1997
5622221	Genga, Jr. et al.	Apr 1997
5625238	Ady et al.	Apr 1997
5627962	Goodrum et al.	May 1997
5628028	Michelson	May 1997
5630076	Saulpaugh et al.	May 1997
5631847	Kikinis	May 1997
5632021	Jennings et al.	May 1997
5636341	Matsushita et al.	Jun 1997
5638289	Yamada et al.	Jun 1997
5644470	Benedict et al.	Jul 1997
5644731	Liencres et al.	Jul 1997
5651006	Fujino et al.	Jul 1997
5652832	Kane et al.	Jul 1997
5652833	Takizawa et al.	Jul 1997
5652839	Giorgio et al.	Jul 1997
5652892	Ugajin	Jul 1997
5652908	Douglas et al.	Jul 1997
5655081	Bonnell et al.	Aug 1997
5655083	Bagley	Aug 1997
5655148	Richman et al.	Aug 1997
5659682	Devarakonda et al.	Aug 1997
5664118	Nishigaki et al.	Sep 1997
5664119	Jeffries et al.	Sep 1997
5666538	DeNicola	Sep 1997
5668943	Attanasio et al.	Sep 1997
5668992	Hammer et al.	Sep 1997
5669009	Buktenica et al.	Sep 1997
5671371	Kondo et al.	Sep 1997
5675723	Ekrot et al.	Oct 1997
5680288	Carey et al.	Oct 1997
5682328	Roeber et al.	Oct 1997
5684671	Hobbs et al.	Nov 1997
5689637	Johnson et al.	Nov 1997
5696895	Hemphill et al.	Dec 1997
5696899	Kalwitz	Dec 1997
5696949	Young	Dec 1997
5696970	Sandage et al.	Dec 1997
5701417	Lewis et al.	Dec 1997
5704031	Mikami et al.	Dec 1997
5708775	Nakamura	Jan 1998
5708776	Kikinis	Jan 1998
5712754	Sides et al.	Jan 1998
5715456	Bennett et al.	Feb 1998
5717570	Kikinis	Feb 1998
5721935	DeSchepper et al.	Feb 1998
5724529	Smith et al.	Mar 1998
5726506	Wood	Mar 1998
5727207	Gates et al.	Mar 1998
5732266	Moore et al.	Mar 1998
5737708	Grob et al.	Apr 1998
5737747	Vishlitzky et al.	Apr 1998
5740378	Rehl et al.	Apr 1998
5742514	Bonola	Apr 1998
5742833	Dea et al.	Apr 1998
5747889	Raynham et al.	May 1998
5748426	Bedingfield et al.	May 1998
5752164	Jones	May 1998
5754396	Felcman et al.	May 1998
5754449	Hoshal et al.	May 1998
5754797	Takahashi	May 1998
5758165	Shuff	May 1998
5758352	Reynolds et al.	May 1998
5761033	Wilhelm	Jun 1998
5761045	Olson et al.	Jun 1998
5761085	Giorgio	Jun 1998
5761462	Neal et al.	Jun 1998
5761707	Aiken et al.	Jun 1998
5764924	Hong	Jun 1998
5764968	Ninomiya	Jun 1998
5765008	Desai et al.	Jun 1998
5765198	McCrocklin et al.	Jun 1998
5767844	Stoye	Jun 1998
5768541	Pan-Ratzlaff	Jun 1998
5768542	Enstrom et al.	Jun 1998
5771343	Hafner et al.	Jun 1998
5774640	Kurio	Jun 1998
5774645	Beaujard et al.	Jun 1998
5774741	Choi	Jun 1998
5777897	Giorgio	Jul 1998
5778197	Dunham	Jul 1998
5781703	Desai et al.	Jul 1998
5781716	Hemphill, II et al.	Jul 1998
5781744	Johnson et al.	Jul 1998
5781767	Inoue et al.	Jul 1998
5781798	Beatty et al.	Jul 1998
5784383	Meaney	Jul 1998
5784555	Stone	Jul 1998
5784576	Guthrie et al.	Jul 1998
5787019	Knight et al.	Jul 1998
5787459	Stallmo et al.	Jul 1998
5787491	Merkin et al.	Jul 1998
5790775	Marks et al.	Aug 1998
5790831	Lin et al.	Aug 1998
5793948	Asahi et al.	Aug 1998
5793987	Quackenbush et al.	Aug 1998
5794035	Golub et al.	Aug 1999
5796185	Takata et al.	Aug 1998
5796580	Komatsu et al.	Aug 1998
5796934	Bhanot et al.	Aug 1998
5796981	Abudayyeh et al.	Aug 1998
5797023	Berman et al.	Aug 1998
5798828	Thomas et al.	Aug 1998
5799036	Staples	Aug 1998
5799196	Flannery	Aug 1998
5801921	Miller	Sep 1998
5802269	Poisner et al.	Sep 1998
5802298	Imai et al.	Sep 1998
5802305	McKaughan et al.	Sep 1998
5802324	Wunderlich et al.	Sep 1998
5802393	Begun et al.	Sep 1998
5802552	Fandrich et al.	Sep 1998
5802592	Chess et al.	Sep 1998
5803357	Lakin	Sep 1998
5805804	Laursen et al.	Sep 1998
5805834	McKinley et al.	Sep 1998
5809224	Schultz et al.	Sep 1998
5809256	Najemy	Sep 1998
5809287	Stupek, Jr. et al.	Sep 1998
5809311	Jones	Sep 1998
5809555	Hobson	Sep 1998
5812748	Ohran et al.	Sep 1998
5812750	Dev et al.	Sep 1998
5812757	Okamoto et al.	Sep 1998
5812858	Nookala et al.	Sep 1998
5815117	Kolanek	Sep 1998
5815647	Buckland et al.	Sep 1998
5815651	Litt	Sep 1998
5815652	Ote et al.	Sep 1998
5821596	Miu et al.	Oct 1998
5822547	Boesch et al.	Oct 1998
5826043	Smith et al.	Oct 1998
5829046	Tzelnic et al.	Oct 1998
5835719	Gibson et al.	Nov 1998
5835738	Blackledge, Jr. et al.	Nov 1998
5838932	Alzien	Nov 1998
5841964	Yamaguchi	Nov 1998
5841991	Russell	Nov 1998
5845061	Miyamoto et al.	Dec 1998
5845095	Reed et al.	Dec 1998
5850546	Kim	Dec 1998
5852720	Gready et al.	Dec 1998
5852724	Glenn, II et al.	Dec 1998
5857074	Johnson	Jan 1999
5857102	McChesney et al.	Jan 1999
5864653	Tavallaei et al.	Jan 1999
5864654	Marchant	Jan 1999
5864713	Terry	Jan 1999
5867730	Leyda	Feb 1999
5875307	Ma et al.	Feb 1999
5875308	Egan et al.	Feb 1999
5875310	Buckland et al.	Feb 1999
5878237	Olarig	Mar 1999
5878238	Gan et al.	Mar 1999
5881311	Woods	Mar 1999
5884027	Garbus et al.	Mar 1999
5884049	Atkinson	Mar 1999
5886424	Kim	Mar 1999
5889965	Wallach et al.	Mar 1999
5892898	Fujii et al.	Apr 1999
5892915	Duso et al.	Apr 1999
5892928	Wallach et al.	Apr 1999
5893140	Vahalia et al.	Apr 1999
5898846	Kelly	Apr 1999
5898888	Guthrie et al.	Apr 1999
5905867	Giorgio	May 1999
5907672	Matze et al.	May 1999
5909568	Nason	Jun 1999
5911779	Stallmo et al.	Jun 1999
5913034	Malcolm	Jun 1999
5922060	Goodrum	Jul 1999
5930358	Rao	Jul 1999
5935262	Barrett et al.	Aug 1999
5936960	Stewart	Aug 1999
5938751	Tavallaei et al.	Aug 1999
5941996	Smith et al.	Aug 1999
5964855	Bass et al.	Oct 1999
5983349	Kodama et al.	Nov 1999
5987554	Liu et al.	Nov 1999
5987621	Duso et al.	Nov 1999
5987627	Rawlings, III	Nov 1999
6012130	Beyda et al.	Jan 2000
6038624	Chan et al.	Mar 2000

Foreign Referenced Citations (5)

Number	Date	Country
0 866 403 A1	Sep 1998	EP
04 333 118 A	Nov 1992	JP
05 233 110 A	Sep 1993	JP
07 093 064 A	Apr 1995	JP
07 261 874 A	Oct 1995	JP

Non-Patent Literature Citations (25)

Entry
Shanley and Anderson, PCI System Architecture, Third Edition, Chapters 15 & 16, pp. 297-328, CR 1995.
PCI Hot-Plug Specification, Preliminary Revision for Review Only, Revision 0.9, pp. i-vi, and 1-25, Mar. 5, 1997.
SES SCSI-3 Enclosure Services, X3T10/Project 1212-D/Rev 8a, pp. i, iii-x, 1-76, and I-1 (index), Jan. 16, 1997.
Compaq Computer Corporation, Technology Brief, pp. 1-13, Dec. 1996, “Where Do I Plug the Cable? Solving the Logical-Physical Slot Numbering Problem.”
ftp.cdrom.com/pub/os2/diskutil/, PHDX software, phdx.zip download, Mar. 1995, “Parallel Hard Disk Xfer.”
Cmasters, Usenet post to microsoft.public.windowsnt.setup, Aug. 1997, “Re: FDISK switches.”
Hildebrand, N., Usenet post to comp.msdos.programmer, May 1995, “Re: Structure of disk partition into.”
Lewis, L., Usenet post to alt.msdos.batch, Apr. 1997, “Re: Need help with automating FDISK and FORMAT.”
Netframe, http://www.netframe-support.com/technology/datasheets/data.htm, before Mar. 1997, “Netframe ClusterSystem 9008 Data Sheet.”
Simos, M., Usenet post to comp.os.msdos.misc, Apr. 1997, “Re: Auto FDISK and FORMAT.”
Wood, M. H., Usenet post to comp.os.netware.misc, Aug. 1996, “Re: Workstation duplication method for WIN95.”
Lyons, Computer Reseller News, Issue 721, pp. 61-62, Feb. 3, 1997, “ACC Releases Low-Cost Solution for ISPs.”
M2 Communications, M2 Presswire, 2 pages, Dec. 19, 1996, “Novell IntranetWare Supports Hot Pluggable PCI from NetFRAME.”
Rigney, PC Magazine, 14(17): 375-379, Oct. 10, 1995, “The One for the Road (Mobile-aware capabilities in Windows 95).”
Shanley, and Anderson, PCI System Architecture, Third Edition, p. 382, Copyright 1995.
Gorlick, M., Conf. Proceedings: ACM/ONR Workshop on Parallel and Distributed Debugging, pp. 175-181, 1991, “The Flight Recorder: An Architectural Aid for System Monitoring.”
IBM Technical Disclosure Bulletin, 92A+62947, pp. 391-394, Oct. 1992, Method for Card Hot Plug Detection and Control.
Davis, T, Usenet post to alt.msdos.programmer, Apr. 1997, “Re: How do I create an FDISK batch file?”
Davis, T., Usenet post to alt.msdos.batch, Apr. 1997, “Re: Need help with automating FDISK and FORMAT . . . ”.
NetFrame Systems Incorporated, Doc. No. 78-1000226-01, pp. 1-2, 5-8, 359-404, and 471-512, Apr. 1996, “NetFrame Clustered Multiprocessing Software: NW0496 DC-ROM for Novel® NetWare® 4.1 SMP, 4.1, and 3.12.”
Shanley, and Anderson, PCI System Architecture, Third Edition, Chapter 15, pp. 297-302, Copyright 1995, “Intro To Configuration Address Space.”
Shanley, and Anderson, PCI System Architecture, Third Edition, Chapter 16, pp. 303-328, Copyright 1995, “Configuration Transactions.”
Sun Microsystems Computer Company, Part No. 802-5355-10, Rev. A, May 1996, “Solstice SyMON User's Guid.”
Sun Microsystems, Part No. 802-6569-11, Release 1.0.1, Nov. 1996, “Remote Systems Diagnostics Installation & User Guide.”
Haban, D. & D. Wybranietz, IEEE Transaction on Software Engineering, 16(2):197-211, Feb. 1990, “A Hybrid Monitor for Behavior and Performance Analysis of Distributed Systems.”

Provisional Applications (1)

	Number	Date	Country
	60/046327	May 1997	US

Method for providing a fault tolerant network using distributed server processes to remap clustered network resources to other servers during server failure

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

US Classifications

Field of Search

US

International Classifications