The present invention relates generally to sharing data across one or more data storage access nodes on a data storage network, and more particularly to systems and method for providing directory-based cache coherency across a distributed network of data storage access nodes.
In current storage networks, and in particular storage networks including geographically remote access nodes and storage resources, preserving or reducing bandwidth between resources and access nodes is highly desirable. It is therefore also desirable that data access be localized, in part to improve access speed to pages requested by host devices. Caching pages at access nodes provides localization, however, the cached data must be kept coherent with respect to modifications at other access nodes that may be caching the same data. Current storage network access solutions, however, do not provide viable coherency mechanisms for caching pages locally at storage network access nodes.
Accordingly, it is desirable to provide efficient data localization and cache coherency systems and methods that overcome the above and other problems. Such systems and methods should also provide reduced bandwidth usage, or messaging requirements, between storage network access nodes.
The present invention provides systems and methods for implementing directory-based cache coherency across a distributed network of data storage access nodes.
According to the present invention, a plurality of access nodes sharing access to data on a storage network implement a directory based cache ownership scheme. One node, designated as a global coordinator, maintains a directory (e.g., table or other data structure) storing information about I/O operations by the access nodes. The other nodes send requests to the global coordinator when an I/O operation is to be performed on identified data. Ownership of that data in the directory is given to the first requesting node. Ownership may transfer to another node if the directory entry is unused or quiescent. According to the present invention, the distributed directory-based cache coherency allows for reducing bandwidth requirements between geographically separated access nodes by allowing localized (cached) access to remote data.
According to one aspect of the present invention, a method is provided for reducing the number messages sent between data access nodes sharing access to a data storage network so as to maintain traffic scalability. The method typically includes maintaining a directory of page ownership entries, wherein ownership of an entry is initially granted to the first access node requesting access to a page in the entry, and wherein ownership of an entry automatically transfers to the node that is accessing pages in the entry more often so as to reduce the number of synchronization messages sent between nodes.
According to another aspect of the present invention, a method is provided for reducing bandwidth between geographically separated access nodes sharing access to data in a data storage network. The method typically includes caching data locally to an access node to provide localized cache access to that data for that node, and maintaining data coherency for cached data between the access nodes using a directory based ownership scheme.
According to yet another aspect of the present invention, a method of providing cache coherence between caches in a distributed set of data access nodes in a data storage network typically includes maintaining a directory in at least one of a plurality of access nodes sharing access to the data storage network, the directory storing information about data accessed by the plurality of access nodes, and receiving, at a first data access node, a data access request from a host system, the data access request identifying data to be processed. The method also typically includes determining whether the first access node has the identified data stored in cache, and if not, determining, using the directory, whether another node in the plurality of access nodes has a copy of the identified data stored in cache, and if a node has a copy of the identified data stored in cache, sending one of a share request to that node to share the identified data so that the requesting node does not have to access the identified data from storage or an invalidate request to invalidate the copy of the data stored in that node's cache.
According to a further aspect of the present invention, a method is provided for reducing the a number messages sent between data access nodes sharing access to a data storage network so as to maintain traffic scalability. The method typically includes maintaining a directory for storing information about data accessed by a plurality of data access nodes, where the directory including entries representing one or more pages of data in the data storage network, and receiving, at a first data access node, a data access request from a host system, where the data access request identifying data to be processed. The method also typically includes determining, using a global directory coordinator, whether a node has ownership of the directory entry for the identified data, and if no node has ownership of the directory entry, granting to the first access node ownership of the directory entry for the identified data, and if a node has ownership of the entry, identifying that node to the first node. The first node can then communicate with the identified node to process an I/O request.
According to yet a further aspect, a system is provided for maintaining cache coherency between a plurality of data access nodes sharing access to a data storage network. The system typically includes a storage system for storing data, and a plurality of access nodes configured to access data in the storage system in response to host requests. One of the nodes is typically configured to maintain a directory (directory node) for storing information about data accessed by the plurality of data access nodes, where the directory includes entries representing one or more pages of data in the data storage network. In operation, upon receiving a data access request identifying data to be processed from a host, a first access node sends a request to the directory node, wherein the directory node determines whether a node has ownership of the directory entry for the identified data; and if no node has ownership of the directory entry, the directory node grants to the first access node ownership of the directory entry for the identified data, and if a node has ownership of the entry, the directory node identifies that node to the first access node.
Reference to the remaining portions of the specification, including the drawings and claims, will realize other features and advantages of the present invention. Further features and advantages of the present invention, as well as the structure and operation of various embodiments of the present invention, are described in detail below with respect to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements.
According to one embodiment, a Directory Manager module, or DMG, is provided. The DMG is responsible for providing cache coherence mechanisms for shared data across a distributed set of data access nodes. The set of nodes that are caching data from a shared data volume are called a share group. In general, a DMG module includes software executing on a processor or other intelligence module (e.g., ASIC) in a node. A DMG module can be implemented in a single node or distributed across multiple intercommunicating nodes. In certain aspects, an access node is embodied as a controller device, or blade, communicably coupled to a storage network, such as a storage area network (SAN), that allows access to data stored on the storage network. However, it will be appreciated that an access node can also be embodied as an intelligent fabric switch or other network device such as a hub adapter Because Locality Conscious Directory Migration (LCDM) is applicable to databases, any networked compute node can be configured to operate as an access node with DMG functionality (e.g., a DMG can be run on a desktop computer with a network connection). U.S. Pat. No. 6,148,414, which is incorporated by reference in its entirety, discloses controller devices and nodes for which implementation of aspects of the present invention are particularly useful.
According to the present invention, distributed cache coherence is important for reducing bandwidth requirements between geographically separated access nodes by allowing localized (cached) access to remote data. According to one aspect, data access cannot be localized unless the data can be cached, yet it is unsafe to cache the data unless it can be kept coherent with respect to modifications at remote access nodes. Although any embodiment of the DMG can satisfy the correctness requirements of cache coherence, the high overhead of many implementations can outweigh the benefits of localized cache access. The LCDM embodiment of the present invention discussed below has demonstrated low enough overhead to make localized cache access practical and beneficial.
The base coherence unit in the DMG is a page (a logical block of storage), but the DMG allows for operations at both the sub-page and the multi-page levels. The directory is a collection of directory entries, each encoding distributed sharing knowledge for a specific page. When concurrent cache operations are active on a page, the directory entry locks and synchronizes access to the distributed resource. Directory information is kept current through point-to-point messages sent between the affected nodes. The DMG cache coherence messaging dialog allows it to share pages from remote caches (e.g., when read requests miss in the local cache) and invalidate remote cached copies (e.g., when write requests supercede previous copies).
Embodiments of a directory placement scheme and ways to take advantage of data access locality are described below in section 1.1. Section 1.2 introduces a messaging dialog and goes on, in Section 1.3, to demonstrate how locking is utilized at the directory to manage concurrent page accesses according to one embodiment. Lastly, Section 1.4 describes how location-aware page sharing is used to improve performance for geographically distributed cache coherence according to one embodiment.
1.1 Directory Placement
According to one aspect, in order to coordinate cache coherence, the DMG maintains a directory with entries tracking information about every active page. Active pages are those that can be found in at least one cache in the associated share group. The directory entry tracks which nodes have copies of the associated page, maintains a distributed lock to protect concurrent accesses to the page, and maintains a queue to serialize operations while the nodes wait for the lock.
1.1.1 Dedicated Directory Server
According to one embodiment, a very simple directory placement scheme is used in which the entire directory is situated on the first node that joins a share group. Such a directory scheme, however, may not scale well with the share group size. In general, there are two competing interests involved in determining the performance of a directory placement scheme. DMG operations on a node containing the directory entry for a given page can be optimized to avoid unnecessary message sends, thereby preserving inter-node bandwidth and improving operation latency. However, the directory node for a page has to process all the relevant message traffic from other nodes in the share group.
1.1.2 Striped Directory Placement
According to another embodiment, directory entries are striped in a round-robin fashion across the nodes in each share group. Because the DMG often has to deal with multi-page operations, the stripe size should be made sufficiently large to avoid frequent splitting of operations. Such striping is easy to implement and has minimal performance overhead, but two problems may make it desirable to seek a better solution. Those two problems are as follows:
According to one embodiment of the present invention, the DMG implements a Locality Conscious Directory Migration, or LCDM, adaptive directory placement scheme. As in the striped directory placement embodiment, in the LCDM embodiment, directory entries are split up into chunks of multiple pages. However, the placement, or ownership, of those chunks is not preset or permanent. Instead, ownership of the directory chunks migrate to the node that is using those pages the most frequently. In addition to taking advantage of data access locality for improved I/O performance, LCDM advantageously helps in quickly recovering from node failure (or nodes being added to the share group), since there is no need to redistribute directory chunks. In one embodiment, directory chunk ownership is granted to the first node that accesses a page in the chunk.
In one embodiment, ownership changes hands only when a directory chunk is empty, after all of the nodes have evicted every page in the chunk from their caches. One node in the share group is designated as the global directory chunk coordinator (“global coordinator”). This node grants ownership of a directory entry, and stores a look-up table, or other similar data structure, identifying ownership of each active page in the share group. The other nodes in the share group maintain local partial mirrors of the look-up table, but occasionally need to defer to the global coordinator. A global coordinator may itself also be or become an owner of a directory entry.
This will cause the requester to re-query the global coordinator. The first node to query the global coordinator after the relinquishment will be granted ownership of the chunk.
In another embodiment, LCDM is enhanced by allowing ownership of directory chunks to migrate between nodes even when they are not empty. This allows for more closely following I/O access patterns and improves the performance of the DMG by minimizing messaging requirements. However, this optimization generally has certain requirements as follows:
According to one embodiment, a message dialog is provided in the DMG for data access requests, e.g., read, write and update requests. Sequence dialogs are used (see
The message sequences when a write request comes down to the DMG from a requesting node are shown in
As with reads, a number of messaging scenarios are possible in the DMG in response to an update (i.e., a sub-page write) request.
According to one aspect, to avoid unnecessary message latency, when a node cache evicts a page it simply sends out an asynchronous EVICT_REQ message to the DMG to notify it of the eviction. Because the notification is asynchronous, the cache may receive requests for the page after evicting it, but they are handled by the I/O message scenarios detailed above.
1.3 DMG Locking
One of the DMG's responsibilities is to prevent multiple concurrent changes to page data. The DMG allows multiple readers to access a page at the same time. Multiple writers, however, are serialized in the order in which the write requests arrived at the directory node, e.g., placed in a FIFO queue or buffer. According to one aspect, the caches on the nodes of the system may also require some serialization or locking to reserve and manipulate page frames. Because the nodes may need to hold both cache and DMG locks for the same pages at the same time, the system preferably includes a deadlock handling mechanism. According to one embodiment, the cache locks are made subordinate to the DMG locks, which means a locally held cache lock for a given page must be released if a DMG operation on that page is received.
When a read request arrives at the directory node, if the directory entry indicates that there are no operations waiting for the page and zero or more readers currently active, the read is allowed to proceed immediately. In contrast, when a write or update request arrives at the directory node, it can only proceed if there is no activity whatsoever on the page. In any other situation, new requests to the directory are queued up in the directory entry until they are activated by the completion of a preceding operation. For multi-page operations, the DMG gathers all of the locks for an operation in increasing order before replying to the client node and allowing the associated operation to proceed. If the DMG were to ignore these constraints and respond out-of-order to the client node for segments of a multi-page operation, a deadlock could occur.
According to one aspect, one optimization in the DMG to avoid wasted effort in the face of multiple concurrent writes to the same page involves checking the lock wait queue before allowing a write to proceed. If the write at the head of the queue is immediately followed by one or more writes from any node, the DMG can preemptively invalidate the earlier writes and allow the last write to proceed immediately. Since concurrent writes are not expected to be a common case, the implementation of this optimization is not crucial.
1.4 Location Awareness
According to one embodiment, the DMG optimizes its page sharing strategy based on the physical or logical proximity of nodes within a share group. In one aspect, the site id of each node is recorded by all of the nodes in each share group. When the directory node receives a read request, it looks first to see if any node within the same site (or close physical proximity) as the reader has the page. If so, it sends a share request to the nearest node. In another aspect, as a second option, sharing is done by the directory node itself (assuming it has the page in cache), to save the share request message send. If the directory node doesn't have the requested page cached, then any other node can be selected. In this case, no consideration is given to the relative distance between sites (i.e., nodes are either near or far in relation to each other). This same algorithm also applies to page sharing in an update situation.
It should be appreciated that code including instructions for implementing aspects of the DMG, including LCDM, can be stored on a computer readable medium such as a CD, DVD, ROM, RAM or the like, or can be transmitted over a network connection to and from data access node devices.
While the invention has been described by way of example and in terms of the specific embodiments, it is to be understood that the invention is not limited to the disclosed embodiments. To the contrary, it is intended to cover various modifications and similar arrangements as would be apparent to those skilled in the art. Therefore, the scope of the appended claims should be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements.
This application claims the benefit of U.S. provisional application No. 60/586,364, filed Jul. 7, 2004, which is hereby incorporated by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
5197146 | LaFetra | Mar 1993 | A |
5410697 | Baird et al. | Apr 1995 | A |
5577204 | Brewer et al. | Nov 1996 | A |
5611070 | Heidelberger et al. | Mar 1997 | A |
5630097 | Orbits et al. | May 1997 | A |
5727150 | Laudon et al. | Mar 1998 | A |
5832534 | Singh et al. | Nov 1998 | A |
5835957 | Lin | Nov 1998 | A |
5875456 | Stallmo et al. | Feb 1999 | A |
5900015 | Herger et al. | May 1999 | A |
6044438 | Olnowich | Mar 2000 | A |
6049851 | Bryg et al. | Apr 2000 | A |
6081833 | Okamoto et al. | Jun 2000 | A |
6112286 | Schimmel et al. | Aug 2000 | A |
6148414 | Brown et al. | Nov 2000 | A |
6170044 | McLaughlin et al. | Jan 2001 | B1 |
6192408 | Vahalia et al. | Feb 2001 | B1 |
6247144 | Macias-Garza et al. | Jun 2001 | B1 |
6263402 | Ronstrom et al. | Jul 2001 | B1 |
6275953 | Vahalia et al. | Aug 2001 | B1 |
6286090 | Steely, Jr. et al. | Sep 2001 | B1 |
6295584 | DeSota et al. | Sep 2001 | B1 |
6356983 | Parks | Mar 2002 | B1 |
6490661 | Keller et al. | Dec 2002 | B1 |
6631449 | Borrill | Oct 2003 | B1 |
6681239 | Munroe et al. | Jan 2004 | B1 |
6760756 | Davis et al. | Jul 2004 | B1 |
6766360 | Conway et al. | Jul 2004 | B1 |
6813522 | Schwarm et al. | Nov 2004 | B1 |
6816891 | Vahalia et al. | Nov 2004 | B1 |
6857059 | Karpoff et al. | Feb 2005 | B2 |
6912668 | Brown et al. | Jun 2005 | B1 |
6920485 | Russell | Jul 2005 | B2 |
6961825 | Steely, Jr. et al. | Nov 2005 | B2 |
7010554 | Jiang et al. | Mar 2006 | B2 |
7136969 | Niver et al. | Nov 2006 | B1 |
7194532 | Sazawa et al. | Mar 2007 | B2 |
7240165 | Tierney et al. | Jul 2007 | B2 |
7266706 | Brown et al. | Sep 2007 | B2 |
7373466 | Conway | May 2008 | B1 |
7395374 | Tierney et al. | Jul 2008 | B2 |
7475207 | Bromling et al. | Jan 2009 | B2 |
7478202 | Niver et al. | Jan 2009 | B1 |
20010037406 | Philbrick et al. | Nov 2001 | A1 |
20010049773 | Bhavsar | Dec 2001 | A1 |
20020013889 | Schuster et al. | Jan 2002 | A1 |
20020059499 | Hudson | May 2002 | A1 |
20020138698 | Kalla | Sep 2002 | A1 |
20020166031 | Chen et al. | Nov 2002 | A1 |
20030018739 | Cypher et al. | Jan 2003 | A1 |
20030023702 | Kokku et al. | Jan 2003 | A1 |
20030105829 | Hayward | Jun 2003 | A1 |
20030167420 | Parsons | Sep 2003 | A1 |
20030233423 | Dilley et al. | Dec 2003 | A1 |
20040019891 | Koenen | Jan 2004 | A1 |
20040044744 | Grosner et al. | Mar 2004 | A1 |
20040260768 | Mizuno | Dec 2004 | A1 |
20050160230 | Doren et al. | Jul 2005 | A1 |
20050160232 | Tierney et al. | Jul 2005 | A1 |
Number | Date | Country |
---|---|---|
0871128 | Oct 1998 | EP |
WO 9828685 | Jul 1998 | WO |
Number | Date | Country | |
---|---|---|---|
20060031450 A1 | Feb 2006 | US |
Number | Date | Country | |
---|---|---|---|
60586364 | Jul 2004 | US |