Claims
- 1. A method of writing to cache in a clustered environment comprising:
receiving a request to write data in a first node of a storage cluster from a user application; determining if the data is owned by a remote node; if the data is owned by the remote node, causing the invalidation of the data in the remote node if necessary; writing the data in a cache of the first node; causing the data to be written in a cache of a partner node of the first node; receiving, in the first node, a response from the partner node;
- 2. The method of claim 1 wherein the determining utilizes a global cache directory that maintains information on which node contains a most up-to-date copy of data.
- 3. The method of claim 2 wherein an invalidation of the remote node is not necessary.
- 4. The method of claim 1 further comprising notifying the user application of a completion of a write operation.
- 5. The method of claim 1 further comprising utilizing a dynamically adjusted upper bound to determine the amount of space available to store data in the partner node.
- 6. The method of claim 5 further comprising:
the first node observing a read-intensive workload; and decreasing the upper bound.
- 7. The method of claim 5 further comprising:
the first node observing a write-intensive workload; and increasing the upper bound.
- 8. The method of claim 5 further comprising:
determining if the upper bound has been reached; waiting until data has been flushed to disk prior to writing to the cache of the partner node.
- 9. The method of claim 1 further comprising:
determining if the first node crashes; and recovering data using the data stored in the cache of the partner node.
- 10. The method of claim 1 further comprising removing a node by:
ensuring that data in the cache of the first node is safely stored; and establishing an owner-partner relationship between the partner node and a second node for which the first node was a partner.
- 11. The method of claim 10 further comprising:
writing data in the cache of the first node to disk; causing any new write requests to the first node to be synchronously written to disk; causing the second node to write data in a cache of the second node to disk; causing the partner node to remove mirrored cache entries for the first node when the writing of the data in the cache of the first node to disk is complete; and removing the first node.
- 12. The method of claim 10 further comprising a global cache directory manager ensuring that directory information is consistent with information stored in the cache of the partner node and a cache of the second node, said ensuring comprising:
removing directory entries for mirrored cache in the partner node that are owned by the first node so that subsequent requests can find data from disk, wherein the first node continues to accept invalidation messages until the global cache directory manager ensures consistent directory states; removing mirrored cache entries in the partner node that are owned by the first node; removing directory entries that are owned by the first node; and informing the first node that it may be removed.
- 13. The method of claim 10 further comprising:
the first node notifying the partner node of the removal of the first node; causing the partner node to read mirrored cache data in the first node; causing the partner node to write the mirrored cache data to the cache of the partner node, wherein the write causes a replication of the data to a cache of a third node; and removing the first node.
- 14. The method of claim 10 further comprising:
storing additional information on who a node's partner is in a phase number; and determining a node's partner based on an indirect lookup table and the phase number.
- 15. The method of claim 10 further comprising:
receiving a node removal command in the second node; identifying the partner node as a partner of the second node; flushing dirty cache from the second node to disk; flushing dirty cache from the first node to disk; invalidating entries in a global cache directory based on the flushing; removing cache entries corresponding to the flushed cache lines from the global cache directory; notifying the first node when the flushing has been completed in the second node; and removing the first node.
- 16. The method of claim 15 wherein block addresses of written data are inserted into a hash table that is used to identify data that has been written to disk.
- 17. The method of claim 1 further comprising causing the data to be asynchronously written to disk.
- 18. An apparatus for writing cache in a clustered environment comprising:
(a) a cache; (b) a first storage node and a partner storage node organized in a storage cluster, each storage node having an interface for connecting to a host and a storage disk, wherein each storage node maintains cache, and wherein at least one of the storage nodes is configured to:
(i) receive a request to write data from a user application; (ii) determine if the data is owned by a remote node; (iii) if the data is owned by the remote node, cause the invalidation of the data in the remote node if necessary; (iv) write the data in a cache of the first node; (v) cause the data to be written in a cache of a partner node of the first node; and (vi) receive, in the first node, a response from the partner node;
- 19. The apparatus of claim 18 further comprising a global cache directory that maintains information on which node contains a most up-to-date copy of data.
- 20. The apparatus of claim 19 wherein an invalidation of the remote node is not necessary.
- 21. The apparatus of claim 18 wherein at least one of the nodes is further configured to notify the user application of a completion of a write operation.
- 22. The apparatus of claim 18 wherein at least one of the nodes is further configured to utilize a dynamically adjusted upper bound to determine the amount of space available to store data in the partner node.
- 23. The apparatus of claim 22 wherein at least one of the nodes is further configured to:
observe a read-intensive workload; and decrease the upper bound.
- 24. The apparatus of claim 22 wherein at least one of the nodes is further configured to:
observe a write-intensive workload; and increase the upper bound.
- 25. The apparatus of claim 22 wherein at least one of the nodes is further configured to:
determine if the upper bound has been reached; wait until data has been flushed to disk prior to writing to the cache of the partner node.
- 26. The apparatus of claim 18 wherein at least one of the nodes is further configured to:
determine if the first node crashes; and recover data using the data stored in the cache of the partner node.
- 27. The apparatus of claim 18 wherein at least one of the nodes may be removed and is configured to:
ensure that data in the cache of the first node is safely stored; and establish an owner-partner relationship between the partner node and a second node for which the first node was a partner.
- 28. The apparatus of claim 27 wherein at least one of the nodes is further configured to:
write data in the cache of the first node to disk; cause any new write requests to the first node to be synchronously written to disk; cause the second node to write data in a cache of the second node to disk; cause the partner node to remove mirrored cache entries for the first node when the writing of the data in the cache of the first node to disk is complete; and remove the first node.
- 29. The apparatus of claim 27 further comprising a global cache directory manager configured to ensure that directory information is consistent with information stored in the cache of the partner node and a cache of the second node, said manager configured to ensure by:
removing directory entries for mirrored cache in the partner node that are owned by the first node so that subsequent requests can find data from disk, wherein the first node continues to accept invalidation messages until the global cache directory manager ensures consistent directory states; removing mirrored cache entries in the partner node that are owned by the first node; removing directory entries that are owned by the first node; and informing the first node that it may be removed.
- 30. The apparatus of claim 27 wherein at least one of the nodes is configured to:
notify the partner node of the removal of the first node; cause the partner node to read mirrored cache data in the first node; cause the partner node to write the mirrored cache data to the cache of the partner node, wherein the write causes a replication of the data to a cache of a third node; and remove the first node.
- 31. The apparatus of claim 27 wherein at least one of the nodes is further configured to:
store additional information on who a node's partner is in a phase number; and determine a node's partner based on an indirect lookup table and the phase number.
- 32. The apparatus of claim 27 wherein at least one of the nodes is further configured to:
receive a node removal command in the second node; identify the partner node as a partner of the second node; flush dirty cache from the second node to disk; flush dirty cache from the first node to disk; invalidate entries in a global cache directory based on the flushing; remove cache entries corresponding to the flushed cache lines from the global cache directory; notify the first node when the flushing has been completed in the second node; and remove the first node.
- 33. The apparatus of claim 32 wherein at least one of the nodes is further configured to insert block addresses of written data into a hash table that is used to identify data that has been written to disk.
- 34. The apparatus of claim 18 wherein at least one of the nodes is further configured to cause the data to be asynchronously written to disk.
- 35. An article of manufacture, embodying logic to perform a method of writing cache in a clustered environment, the method comprising:
receiving a request to write data in a first node of a storage cluster from a user application; determining if the data is owned by a remote node; if the data is owned by the remote node, causing the invalidation of the data in the remote node if necessary; writing the data in a cache of the first node; causing the data to be written in a cache of a partner node of the first node; receiving, in the first node, a response from the partner node;
- 36. The article of manufacture of claim 35 wherein the determining utilizes a global cache directory that maintains information on which node contains a most up-to-date copy of data.
- 37. The article of manufacture of claim 36 wherein an invalidation of the remote node is not necessary.
- 38. The article of manufacture of claim 35, the method further comprising notifying the user application of a completion of a write operation.
- 39. The article of manufacture of claim 35, the method further comprising utilizing a dynamically adjusted upper bound to determine the amount of space available to store data in the partner node.
- 40. The article of manufacture of claim 39, the method further comprising:
the first node observing a read-intensive workload; and decreasing the upper bound.
- 41. The article of manufacture of claim 39, the method further comprising:
the first node observing a write-intensive workload; and increasing the upper bound.
- 42. The article of manufacture of claim 39, the method further comprising:
determining if the upper bound has been reached; waiting until data has been flushed to disk prior to writing to the cache of the partner node.
- 43. The article of manufacture of claim 35, the method further comprising:
determining if the first node crashes; and recovering data using the data stored in the cache of the partner node.
- 44. The article of manufacture of claim 35, the method further comprising removing a node by:
ensuring that data in the cache of the first node is safely stored; and establishing an owner-partner relationship between the partner node and a second node for which the first node was a partner.
- 45. The article of manufacture of claim 44, the method further comprising:
writing data in the cache of the first node to disk; causing any new write requests to the first node to be synchronously written to disk; causing the second node to write data in a cache of the second node to disk; causing the partner node to remove mirrored cache entries for the first node when the writing of the data in the cache of the first node to disk is complete; and removing the first node.
- 46. The article of manufacture of claim 44, the method further comprising a global cache directory manager ensuring that directory information is consistent with information stored in the cache of the partner node and a cache of the second node, said ensuring comprising:
removing directory entries for mirrored cache in the partner node that are owned by the first node so that subsequent requests can find data from disk, wherein the first node continues to accept invalidation messages until the global cache directory manager ensures consistent directory states; removing mirrored cache entries in the partner node that are owned by the first node; removing directory entries that are owned by the first node; and informing the first node that it may be removed.
- 47. The article of manufacture of claim 44, the method further comprising:
the first node notifying the partner node of the removal of the first node; causing the partner node to read mirrored cache data in the first node; causing the partner node to write the mirrored cache data to the cache of the partner node, wherein the write causes a replication of the data to a cache of a third node; and removing the first node.
- 48. The article of manufacture of claim 44, the method further comprising:
storing additional information on who a node's partner is in a phase number; and determining a node's partner based on an indirect lookup table and the phase number.
- 49. The article of manufacture of claim 44, the method further comprising:
receiving a node removal command in the second node; identifying the partner node as a partner of the second node; flushing dirty cache from the second node to disk; flushing dirty cache from the first node to disk; invalidating entries in a global cache directory based on the flushing; removing cache entries corresponding to the flushed cache lines from the global cache directory; notifying the first node when the flushing has been completed in the second node; and removing the first node.
- 50. The article of manufacture of claim 49 wherein block addresses of written data are inserted into a hash table that is used to identify data that has been written to disk.
- 51. The article of manufacture of claim 35, the method further comprising causing the data to be asynchronously written to disk.
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application is related to the following co-pending and commonly assigned patent applications, which applications are incorporated by reference herein:
[0002] U.S. patent application Ser. No. 09/755,858, METHOD AND APPARATUS FOR SUPPORTING PARITY PROTECTED RAID IN A CLUSTERED ENVIRONMENT”, by Lawrence Yium-chee Chiu et. al., Attorney Docket No. ARC9-2000-0054-US1, filed on Jan. 5, 2001;
[0003] U.S. patent application Ser. No. xx/xxx,xxx, filed on the same date herewith, entitled “METHOD AND APPARATUS FOR A GLOBAL CACHE DIRECTORY IN A STORAGE CLUSTER”, by Lawrence Yium-chee Chiu et. al., Attorney Docket No. ARC9-2000-0055-US1; and
[0004] U.S. patent application Ser. No. xx/xxx,xxx, filed on the same date herewith, entitled “METHOD AND APPARATUS FOR CACHE SYNCHRONIZATION IN A CLUSTERED ENVIRONMENT”, by Lawrence Yium-chee Chiu et. al., Attorney Docket No. ARC9-2000-0056-US1.