Communication or messaging devices such as desktop personal computers, mobile computing devices or cellular phones access or retrieve data from a remote data center including one or more nodes or servers. Often, resources or data of the data center must be partitioned across one or more owner nodes or servers of the data center. This is often done via a partitioning manager that uses messages to assign ownership for certain resources to nodes. During the partition grant process grant messages to the owner nodes can be interrupted and lost. Additionally, depending upon the architecture of the partitioning manager, there may be multiple instances of a single logical resource, creating confusion over the contents of the resource.
The application discloses a generic partitioning manager for partitioning resources across one or more owner nodes. In illustrated embodiments described herein, the partitioning manager interfaces with the one or more owner nodes through an owner library. A lookup node interfaces with the partitioning manager through a lookup library to lookup addresses or locations of partitioned resources. In illustrated embodiments, resources are partitioned via leases, which are granted via the partitioning manager in response to lease request messages from owner libraries. In an illustrated embodiment, the owner nodes request all leases that they are entitled to, allowing the partitioning manager to spread the resources across all owners while taking multiple concerns into account, such as load on the owner nodes. In illustrated embodiments, the lease grant message includes a complete list of the leases for the owner node.
In an illustrated embodiment shown in
In the illustrated embodiment, the one or more client devices 104 communicate with the data center 100 through the one or more lookup nodes 120 via a load balancer 125. The load balancer 125 directs or distributes incoming operations or messages across the nodes 120. Thus, in an illustrated embodiment, the lookup nodes 120 and the owner nodes 122 communicate using a communication layer. Furthermore, in an illustrated embodiment, the owner nodes 122 define a storage layer. In embodiments described herein, the lookup nodes 120 and owner nodes 122 interface through the partitioning and recover manager 124 for the purpose of partitioning resources and delivering recovery notifications.
In the embodiment illustrated in
As shown, leases are generated via a lease generator component 130 based upon load measurements and other status of the one or more owner nodes 122, such as the liveness status. Data may be described as either soft state data, data that is likely to be lost in the event of a routine failure, such as the crash of an individual computer, or hard state data, data that is unlikely to be lost except in the event of a catastrophic failure, such as the data center 100 being hit by a meteor.
In the illustrated embodiment, the lease generator component 130 generates leases for soft state data or other resources. In an alternative embodiment, the lease generator component 130 generates leases for hard state data or other resources. The leases illustratively include a lease time frame or expiration and a lease version as described in more detail in the application.
Although in the embodiment described, ownership is allocated via leases, application is not limited to a lease-based system and embodiments described herein can be implemented on a non-lease based system. In Another embodiment, the owner nodes 122 request ownership of a lease for a particular resource or bucket and the partitioning and recovery manager 124 assign ownership based upon the owner node's request 122.
The lookup node 120 is configured to lookup partitioned resources on behalf of requests initiated by the client devices 104. As shown in
As described, the partitioning and recover manager 124 is not otherwise integrated with the storage layer at the owner nodes 122 or the communication layers between the lookup nodes 120 and the owner nodes 122. By not being integrated with the storage and communication layers, the system achieves its goal of being usable in the context of many different services. To implement some new application, it is only necessary to write new code at the owner node 122 and or lookup node 120 and to use the API exposed by the lookup and owner libraries 132 and 136. Because it is not integrated with the storage and communication layers, the partitioning and recovery manager 124 communicates with the lookup nodes 120 and owner nodes 122 through communication protocols or calls described herein.
In illustrated embodiments, resources are hashed into buckets for lease generation and management. The hashed resources or objects are represented by SummaryKeys using a SummaryKeyRange or ResourceKeyRange, which stores the low and high endpoints for a hash range. A SummaryKey is a 64 bit entity that corresponds to a hash of a resource address. In one embodiment, the partitioning and recovery manager 124 maps buckets directly to servers or nodes 122.
For example, a first fixed number of buckets are mapped to the first server or node and a second fixed number of buckets are mapped to a second server or node. The split of the hash space into buckets may be constant, with each bucket containing a constant fraction of the hash space, or the split of the hash space into buckets may be a function of the nodes currently in the system. For example, given a list of nodes (e.g. node1, node2, etc) and a virtual node count for each node, each node is mapped to as many points in the hash output space 140 as its virtual node count as illustrated in
The consistent hashing state is simply represented using an array of tuples of the form: (<serverid or node address, virtual server count, start offset>). To obtain the range in the output space of each bucket one simply computes hash (<serverid/nodeaddress,start offset+0>), hash(,serverid>:<start offset>+1) . . . , for all serverids or node addresses and then sorts them. The values between the sorted hash values are the range of each bucket.
In an illustrated data structure the buckets are represented using a binary tree 142 as illustrated in
As previously described, the partitioning and recovery manager 124 communicates with the one or more owners libraries 136 to generate the leases for buckets or resources.
As illustrated in step 140, the owner node 122 will initiate a lease request message. In one embodiment, the lease request message has the following format.
As illustrated in step 142 of
As illustrated in
The lease grant message includes the lists of versioned buckets “leases to grant” and “leases to extend”. As shown, the message format utilizes a compact data structure to provide a single message that includes a complete list of the leases held by the owner node 122. Any lease not mentioned in the message is interpreted as not being held or assigned to the owner node 122.
Since the lease message includes the entire lease state of an owner node 122 in a single message, the lease message is self-describing. Because it is self-describing, there is no need to send incremental lease updates to an owner library 136; the partitioning and recovery manager 124 sends all lease grants for the library in every message. Self-describing leases messages facilitate resource moves for load balancing, and reallocating ownership when new servers are brought online or servers are removed or crash, avoiding many of the protocol complexities faced by incremental lease updates.
In step 146, the owner library lease table 152 is updated as illustrated in
In the illustrated embodiment shown in
In the illustrated embodiment shown in
A lookup call is initiated as follows.
The lookupHandler supplies the following method for returning the results:
An example interface or lookup method results include:
In the type or method above—ResolveSucceeded indicates that the address was resolved with more location information. AlreadyFullyResolved indicates that no further location information that the partitioning and recovery manager 124 can provide. CannotResolveLocally indicates that the caller can try resolution at a different (remote) entity e.g. the cluster specified is not the current cluster. Failed indicates that it could not resolve, e.g. the lookup library could not contact the partitioning and recovery manager.
When the lookup node 120 attempts to use the lookup library to further resolve a resource address, it may additionally specify IsRetry, a hint that the caller has recently performed a lookup on the same address and the resulting address turned out to be incorrect. When the lookup is done, the lookup library 132 schedules a lookupHandler on the caller's work queue with the appropriate lookup result about whether the call succeeded, and if so, the new address information.
In an illustrated embodiment, the data center or system 100 uses a hierarchical resource address methodology. The methodology includes ResourceAddress, GlobalResourceAddress, ClusterLevelResourceAddress and NodeLevelResourceAddress. The ResourceAddress is an abstract class to denote all resources of the system.
The GlobalResourceAddress is a location independent address that essentially corresponds to a generic resource name. The ClusterLevelResourceAddress is a location dependent address specifying the cluster 110 but not the particular owner node 122 within the cluster 110. The NodeLevelResourceAddress is a location dependent address specifying the particular owner node 122.
If a client device 104 passes in a GlobalResourceAddress, the lookup library 132 attempts to resolve it to a ClusterLevelResourceAddress. If the client device passes in a ClusterLevelResourceAddress the lookup library will verify if the cluster is the correct and if so, the lookup library 132 will attempt to resolve it to a node level resource address. If the client device 104 passes in a NodeLevelResourceAddress, the library does not further resolve the address. Although three hierarchical levels are present in the illustrated embodiment, application is not limited to the illustrated embodiments shown.
In the illustrated embodiment, in response to a lookup call, the lookup library checks the lookup library cache 160 as illustrated in
The lastPRMTime is the time that the lookup library received in a previous LookupResponse message from the partitioning and recovery manager 124.
As shown in
The list of buckets crashed is constructed from a partitioning and recovery manager bucket crash table 164 illustrated in
As illustrated in
In an illustrated embodiment once the application receive the lookup result or address, an application or device 104 can contact or communicate with the owner node 122 directly without communicating through the partitioning and recovery manager 124 to retrieve the resource data.
In the illustrated embodiment shown in
The recoveryHandler supplies the following method for returning the results to the lookup nodes via a recovery notification callback as illustrated by line 172 as follows:
As previously described, the lookup node may not automatically learn that a resource or bucket is lost. In illustrated embodiments, the recovery notification registration function provides a call or notification as illustrated by line 172 in
Recovery notification calls at the lookup library are initiated for crashed buckets after the lookup library 132 has learned about crashed buckets recorded in the partitioning and recovery manager crash table 164 shown in
In one embodiment, the partitioning and recovery manager 124 is configured to periodically interface with the lookup libraries 132 to update the lookup library 132 and library cache with data relating to the partitioned resources. In particular, in one embodiment, the partitioning and recovery manager 124 initiates communication with the lookup libraries 132 to notify the lookup libraries 132 of crashed resources or buckets. As illustrated in
The owner library 136 is configured to store partitioned resources or buckets. The owner node 122 interfaces with the partitioning and recovery manager 124 through the owner library 126 as previously illustrated in
Illustratively, the interface or method for obtaining and validating ownership from the owner library 136 is implemented with the following call:
An Owner node checks its ongoing ownership of a resource with the following call:
bool CheckContinuousOwnership (OwnershipHandle handle).
The check continuous ownership function is configured to ascertain whether the owner node currently owns and whether it has continuously owned a resource or bucket since the resource or bucket was first acquired. The function or method uses an ownership handle to return the results.
Resources are moved from one owner node to another in response to resource move messages from the partitioning and recovery manager 124.
In step 196, the first owner library is notified via calls from the first owner node that the move is complete as illustrated in step 196. In step 198, the first owner library generates a move result message and lease request message to the partitioning and recovery manager 124. The partitioning and recovery manager 124 sends a lease grant message to the new or second owner in step 199. In an illustrated embodiment, move tables are generated to keep track of the success or failure of a move function.
Illustrated embodiments of the data management system have applications for managing information and services for various functions of a communication network, for example, for publish-subscribe services, queue services, device connectivity services, account services, authorization services, storage services, general notification services, and other services of a communication system or network, although application is not limited to these illustrated
Embodiments and methods disclosed herein can be utilized to manage data across multiple clusters (e.g. inter-cluster partitioning) or across data centers. In particular, the lookup nodes 122 can be in different clusters or data centers 100 than the partitioning and recovery manager 124. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter of the application is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as illustrated examples.
This application claims priority to Provisional application, Ser. No. 60/998,647 filed Oct. 12, 2007 and entitled “A LEASE MANAGEMENT SYSTEM HAVING APPLICATION FOR PARTITIONING SOFT STATE DATA”.
Number | Name | Date | Kind |
---|---|---|---|
5701480 | Raz | Dec 1997 | A |
5884024 | Lim et al. | Mar 1999 | A |
6144983 | Klots et al. | Nov 2000 | A |
6704756 | Wollrath et al. | Mar 2004 | B2 |
6775703 | Burns et al. | Aug 2004 | B1 |
6862666 | Chong et al. | Mar 2005 | B2 |
6917976 | Slaughter et al. | Jul 2005 | B1 |
7124131 | Guthridge et al. | Oct 2006 | B2 |
7171590 | Kadoi | Jan 2007 | B2 |
7260543 | Saulpaugh et al. | Aug 2007 | B1 |
7266822 | Boudnik et al. | Sep 2007 | B1 |
7272664 | Arimilli et al. | Sep 2007 | B2 |
7426721 | Saulpaugh et al. | Sep 2008 | B1 |
7458082 | Slaughter et al. | Nov 2008 | B1 |
7533141 | Nadgir et al. | May 2009 | B2 |
7685144 | Katragadda | Mar 2010 | B1 |
7774782 | Popescu et al. | Aug 2010 | B1 |
7865898 | Vaidyanathan et al. | Jan 2011 | B2 |
7941401 | Okamoto | May 2011 | B2 |
7962915 | Eshel et al. | Jun 2011 | B2 |
8082491 | Abdelaziz et al. | Dec 2011 | B1 |
8266634 | Wolman et al. | Sep 2012 | B2 |
8275912 | Kakivaya et al. | Sep 2012 | B2 |
20010000812 | Waldo et al. | May 2001 | A1 |
20020078213 | Chang et al. | Jun 2002 | A1 |
20030023826 | McMichael et al. | Jan 2003 | A1 |
20030041141 | Abdelaziz et al. | Feb 2003 | A1 |
20030105882 | Ali et al. | Jun 2003 | A1 |
20030145210 | Taylor | Jul 2003 | A1 |
20030172106 | Highleyman et al. | Sep 2003 | A1 |
20040042446 | Koch et al. | Mar 2004 | A1 |
20040162871 | Pabla et al. | Aug 2004 | A1 |
20040202466 | Koch et al. | Oct 2004 | A1 |
20040215640 | Bamford et al. | Oct 2004 | A1 |
20040215905 | Armstrong et al. | Oct 2004 | A1 |
20050086384 | Ernst | Apr 2005 | A1 |
20050108362 | Weinert et al. | May 2005 | A1 |
20050114291 | Becker-Szendy et al. | May 2005 | A1 |
20050203962 | Zhou et al. | Sep 2005 | A1 |
20050235289 | Barillari et al. | Oct 2005 | A1 |
20050240621 | Robertson et al. | Oct 2005 | A1 |
20050289098 | Barsness et al. | Dec 2005 | A1 |
20050289240 | Cheshire et al. | Dec 2005 | A1 |
20060015507 | Butterworth et al. | Jan 2006 | A1 |
20060036896 | Gamache et al. | Feb 2006 | A1 |
20060085648 | Cheston et al. | Apr 2006 | A1 |
20060136369 | Douceur et al. | Jun 2006 | A1 |
20060168154 | Zhang et al. | Jul 2006 | A1 |
20060184528 | Rodeh | Aug 2006 | A1 |
20060277180 | Okamoto | Dec 2006 | A1 |
20070043771 | Ludwig et al. | Feb 2007 | A1 |
20070156588 | Howell et al. | Jul 2007 | A1 |
20070174428 | Lev Ran et al. | Jul 2007 | A1 |
20070271365 | Revanuru et al. | Nov 2007 | A1 |
20080209434 | Queck et al. | Aug 2008 | A1 |
Entry |
---|
“Adaptive Soft-State Protocol in Replica Location Services”, http://www.cs.virginia.edu/˜son/cs662.f07/proj/sangmin.pro.doc, at least as early as Sep. 26, 2007, 6 pages. |
Chervenak, et al., “Giggle: A Framework for Constructing Scalable Replica Location Services”, University of Southern California, Marina del Rey, CA, IEEE, at least as early as Sep. 26, 2007; 17 pages. |
Mahalingam, et al., “Data Migration in a Distributed File Service”, CA, USA, http://www.hpl.hp.com/techreports/2001/HPL-2001-128.ps, 11 pages. |
Yun, et al., “Sharp: An Architecture for Secure Resource Peering”,Date: Oct. 19-22, 2003, New York, USA; 16 pages. |
Gray et al., “Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency”, Computer Science Department, Stanford University, at least as early as Sep. 26, 2007; pp. 202-210. |
Burrows, Mike; “The Chubby lock service for loosely-coupled distributed systems”, Google Inc., at least as early as Sep. 26, 2007; 16 pages. |
Prosecution History for U.S. Appl. No. 11/958,392, filed Dec. 18, 2007. |
Number | Date | Country | |
---|---|---|---|
20090100436 A1 | Apr 2009 | US |
Number | Date | Country | |
---|---|---|---|
60998647 | Oct 2007 | US |