Claims
- 1. A partitioned computer system for containing and handling a packet processing failure in a failed domain, comprising:
a resource definition table for storing a status of at least one allocated resource dynamically shared by at least one domain, each resource associated with a domain ID identifying the domain to which the resource is allocated; and a system manager having write and read access to the resource definition table, the system manager adapted to identify an allocated resource and the failed domain associated with the allocated resource, using the domain ID.
- 2. The system of claim 1, further comprising a plurality of computer nodes coupled via an interconnect, and wherein the system manager is further adapted to enter a quiesce mode for each node in each domain.
- 3. The system of claim 2, wherein the system manager is further adapted to identify at least one non-failed domain and to exit the quiesce mode for the at least one non-failed domain.
- 4. The system of claim 1, wherein the system manager is further adapted to deallocate the allocated resource associated with the failed domain by changing a status of the resource indicated in the resource definition table.
- 5. The system of claim 1, wherein each resource in the resource definition table is associated with a valid bit having a specified value indicating whether the resource is allocated.
- 6. The system of claim 5, wherein the specified value indicates that the resource is allocated in response to the valid bit being zero.
- 7. The system of claim 5, wherein the specified value indicates that the resource is allocated in response to the valid being one.
- 8. In a computer system partitioned into at least two domains, each domain having a plurality of computer nodes, a method for containing and handling a packet processing failure in a failed domain, comprising:
entering a quiesce mode for each node in each domain, in response to the packet processing failure in the system; identifying an allocated resource in a resource definition table; identifying the failed domain associated with the allocated resource in the resource definition table; identifying at least one non-failed domain as having no allocated resource associated with it in the resource definition table; exiting the quiecse mode for the non-failed domain; and deallocating the allocated resource associated with the failed domain in the resource definition table.
- 9. The method of claim 8, further comprising a step of resetting the failed domain.
- 10. The method of claim 8, wherein the step of resetting the failed domain further comprises changing a state of the failed domain.
- 11. The method of claim 8, wherein the step of entering the quiesce mode further comprises issuing a lock acquisition request to each node in each domain.
- 12. The method of claim 8, wherein the step of exiting the quiesce mode further comprises issuing a lock release request to each node in each domain.
- 13. The system of claim 2, wherein the computer node is a CPU node.
- 14. The system of claim 2, wherein the computer node is an I/O node.
- 15. The system of claim 2, wherein the computer node is a memory node.
- 16. The system of claim 1, wherein the system manager is implemented as hardware.
- 17. The system of claim 1, wherein the system manager is implemented as software.
- 18. The system of claim 1, wherein the system manager is implemented as a software residing on a computer external to the system.
RELATED APPLICATION
[0001] This application is a continuation-in-part and claims priority from U.S. patent application Ser. No. 09/861,293 entitled “System and Method for Partitioning a Computer System into Domains” by Kazunori Masuyama, Patrick N. Conway, Hitoshi Oi, Jeremy Farrell, Sudheer Miryala, Yukio Nishimura, Prabhunanadan B. Narasimhamurthy, filed May 17, 2001.
[0002] This application also claims priority under 35 U.S.C. §119(e) to U.S. Provisional Application No. 60/301,969, filed Jun. 29, 2001, and entitled “Fault Containment and Error Handling in a Partitioned System with Shared Resources” by Kazunori Masuyama, Yasushi Umezawa, Jeremy J. Farrell, Sudheer Miryala, Takeshi Shimizu, Hitoshi Oi, and Patrick N. Conway, which is incorporated by reference herein in its entirety.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60301969 |
Jun 2001 |
US |
Continuation in Parts (1)
|
Number |
Date |
Country |
Parent |
09861293 |
May 2001 |
US |
Child |
10150618 |
May 2002 |
US |