A failover process can be switching an application to a redundant or a standby server computing device (“server”), a hardware component or a computer network typically upon unavailability of the previously active server, hardware component, or the network. A server can become unavailable due to a failure, abnormal termination, or planned termination for performing some maintenance work. The failover process can be performed automatically, e.g., without human intervention and/or manually. The failover processes can be designed to provide high reliability and high availability of data and/or services. Some failover processes backup or replicate data to off-site locations, which can be used in case the infrastructure at the primary location fails. Although the data is backed up to off-site locations, the applications to access that data may not be made available, e.g., because the failover processes may not failover the application. Accordingly, the users of the application may have to experience a downtime—a period during which the application is not available to the users. Such failover processes can provide high reliability but may not be able to provide high availability.
Some failover processes failover both the application and the data. However, the current failover processes are inefficient, as they may not provide high reliability and high availability. For example, the current failover process can failover the application to a standby server and serve users requests from the standby server. However, the current failover processes may not ensure that the data is replicated entirely from the primary system to the stand-by system. For example, the network for replicating the data may be overloaded and data may not be replicated entirely or is being replicated slowly. When a user issues a data access request, e.g., for obtaining some data, the stand-by system may not be able to obtain the data, thereby causing the user to experience data loss. That is, the failover process may provide high availability but not high reliability.
Further, the current failover processes can be even more inefficient in cases where the failover has to be performed from a set of servers located in a first region to a set of servers located in a second region. The regions can be different geographical locations that are farther apart, e.g., latency between the systems of different regions is significant. For example, while it can take a millisecond to determine if a server within a specified region has failed, it can take few hundreds of milliseconds to determine if a server in another region has failed from the specified region. Current failover processes may not be able to detect the failures across regions reliably and therefore, if the application has to be failed over from the first region to the second region, the second region may not be prepared to host the application yet.
Embodiments are disclosed for a failover mechanism to fail over an application service, e.g., a messenger service in a social networking application, executing on a first set of server computing devices (“servers”) in a first region to a second set of servers in a second region. The failover mechanism supports both planned failover and unplanned failover of the application service. The failover mechanism can failover the application service while still providing high availability of the application service with minimum data loss. Further, in a planned failover process, the failover mechanism can failover the application service to the second region without any data loss and without disrupting the availability of the application service to users of the application service.
The application service can be implemented at a number of server computing devices (“servers”). The servers can be distributed across a number of regions, e.g., geographical regions such as continents, countries, etc. Each region can have a number of the servers and an associated data storage system (“storage system”) in which the application service can store data. The application service can store data, e.g., user data, as multiple shards in which each shard contains data associated with a subset of the users. A shard can be stored at multiple regions in which one region is designated as a primary region and one or more regions are designated as secondary regions for the shard. A primary region for a specified shard can be a region that is assigned to process and/or serve all data access requests from users associated with the specified shard. For example, data access requests from users associated with the specified shard are served by the servers in the primary region for the specified shard. The secondary region can store a replica of the specified shard, and can also be used as a new primary region for the specified shard in an event the current primary region for the specified shard is unavailable, e.g., due to a failure.
When a data access request, e.g., a message, is received from a user, the message is processed by a server in the primary region for the specified shard with which the user is associated, replicated to the storage system in the secondary region for the shard, and stored at the storage system in the primary region. A global shard manager computing device (“global shard manager”) can manage failing over the application service from a first region, e.g., the current primary region for a specified shard, to a second region, e.g., one of the secondary regions for the specified shard. As a result of the failover, the second region can become the new primary region for the specified shard, and the first region, if still available, can become the secondary region for the specified shard.
The failover can be a planned failover or an unplanned failover. In the event of the planned failover, the global shard manager can trigger the failover process by designating one of the secondary regions, e.g., the second region, as the expected primary region for the specified shard. In some embodiments, shard assignments to servers within a region can be managed using a regional shard manager computing device (“regional shard manager”). A first regional shard manager associated with the current primary region, e.g., the first region, determines whether one or more criteria for failing over the application service are satisfied. For example, the first regional shard manager determines whether there is a replication lag between the current primary region and the expected primary region. If there is no replication lag, e.g., the storage system of the expected primary region has all of the data associated with the specified shard that is stored at the current primary region, the first regional shard manager requests the global shard manager to promote the expected primary region, e.g., the second region, as the new primary region for the specified shard, and to demote the current primary region, e.g., the first region, to the secondary region for the specified shard. Any necessary services and processes for serving data access requests from the users associated with the specified shard are started at the servers in the second region. Any data access requests from the users associated with the specified shard are now forwarded to the servers in the second region, as the second region is the new primary region for the specified shard.
Referring back to determining whether there is a replication lag, if there is a replication lag, then the first regional shard manager can determine whether the replication lag is within a specified threshold, e.g., whether the storage system of the expected primary region has most of the data associated with the specified shard stored at the current primary region. If the replication lag is within the specified threshold, the first regional shard manager can wait until there is no replication lag. While the first regional shard manager is waiting for the replication lag to become zero, e.g., all data associated with the specified shard is copied to the storage system at the second region, the first regional shard manager can block any incoming data access requests to the first region from the users associated with the specified shard so that the replication lag does not increase. After the replication lag becomes zero, the first regional shard manager can instruct the global shard manager to promote the expected primary region, e.g., the second region, as the primary region and demote the current primary region, e.g., the first region, to being the secondary region for the specified shard. After the second region is promoted to the primary region, the first region can also forward the blocked data access requests to the second region. Referring back to determining whether the replication lag is below the specified threshold, if the first regional shard manager determines that the replication is above the specified threshold, it can indicate to the global shard manager that the fail over process may not be initiated.
In the event of the unplanned failover, e.g., due to servers failing in the primary region, the global shard manager instructs one of the secondary regions of the specified shard, e.g., the second region, to become the new primary region and fails over the application service to the new primary region. If the replication lag of the new primary region is above the specified threshold, the application service can be unavailable to the users associated with the specified shard up until the replication lag is below the threshold or is zero. In some embodiments, the application service can be made immediately available to the users regardless of the replication lag, however, the users may experience a data loss in such a scenario.
Turning now to the figures,
Also illustrated in the example 200 is assignment of shards to application servers within a region. A regional shard manager 210 can manage the assignment of shards to the application servers. In some embodiments, the assignments are input by the administrator. In some embodiments, the regional shard manager 210 can determine the shard assignments based on shard-server assignment policies provided by the administrator. The regional shard manager 210 can store the shard-server assignments in a shard-server assignment table 235. In the example 200, the first shard “S1,” is assigned to an application server “A11.” This mapping can indicate that data access requests from users associated with shard “S1” are processed by the application server “A11.” In some embodiments, each of the regions can have a regional shard manager, such as the regional shard manager 210.
In some embodiments, each of the regions can be a different geographical region, e.g., a country, continent. Typically, a response time for accessing data at the storage system within a particular region is lesser than that of accessing data from the storage system in a different region from that of the application server. In some embodiments, two systems are considered to be in different regions if the latency between them is beyond a specified threshold.
As described at least with reference to
A data access request from a user is served by a specified region and a specified application server in the specified region based on a specified shard with which the user is associated. When a user issues a data access request 310 from a client computing device (“client”) 305, a routing computing device (“routing device”) 315 determines a shard with which the user is associated. In some embodiments, the routing device 315, the global shard manager 205 and/or another service (not illustrated) can have information regarding the mapping of the users to the shards, e.g., user identification (ID) to shard ID mapping. For example, the user is associated with the first shard 151. After determining the shard ID, the routing device 315 can determine the primary region for the first shard 151 using the global shard manager 205. For example, the routing device 315 determines that the first region 250 is designated as the primary region for the first shard 151. The routing device 315 then contacts the regional shard manager of the primary region, e.g., the first regional shard manager 327 to determine the application server to which the data access request is to be routed. For example, the first regional shard manager 327 indicates, e.g., based on the shard-server assignment table 235, that the data access requests for the first shard 151 is to be served by the application server “A11” in the first region 350. The routing device 315 sends the data access request 310 to the application server “A11” accordingly.
The application server “A11” processes the data access request 310. For example, if the data access request 310 is a request for sending data, e.g., a message to another user, the application server “A11” sends the message to another user, sends the message to the first data server 370 for storing the message at the first storage system 365. In some embodiments, the first data server 370 also replicates the data received from the data access request 310 to the secondary regions, e.g., the second region 325 and the third region 375, of the first shard 151. The data servers at the respective secondary regions receive the data and store the received data at their corresponding storage systems.
In some embodiments, the application service 110 can be failed over from a first region 350 to another region for various reasons, e.g., for performing maintenance work on the application servers 360, or the application servers 360 become unavailable due to a failure. The failover can be a planned failover or an unplanned failover. The global shard manager 205 and the regional shard managers, e.g., the regional shard managers of the primary region and one of the secondary regions that is expected to be the new primary region can coordinate with each other to perform the failover process for a specified shard. As a result of the failover process, the current primary region of the specified shard is demoted to be a secondary region for the specified shard and one of the current secondary regions is promoted to be the new primary region for the specified shard. In some embodiments, the global shard manager 205 determines the secondary region that has to be promoted as the new primary region for the specified shard. In some embodiments, the failover process is performed per shard. However, the failover process can be performed for multiple shards in parallel or in sequence.
Consider that the application service 110 has to be failed over for the first shard “S1” 151 from the first region 350 to the second region 325. The first region 350 is the current primary region (e.g., primary region prior to the failover process) of the first shard 151. The second region 325 and the third region 375 are the current secondary regions for the first shard 151, and the second region 325 is to be the new primary region for the first shard 151 (as a result of the failover process).
The global shard manager 205 can trigger the failover process by updating a value of an expected primary region of the first shard 151, e.g., to the second region 325. In some embodiments, a request receiving component (not illustrated) in the global shard manager 205 can receive the request for failing over from the administrator. The administrator can update the expected primary region attribute value using the request receiving component. Upon a change in value of the expected primary region of the first shard 151, the regional shard manager of the current primary region for the first shard 151, e.g., the first regional shard manager 327 of the first region 350, determines whether one or more criteria for failing over the application service 110 is satisfied. In some embodiments, the regional shard managers have a criteria determination component (not illustrated) that can be used to determine whether the one or more criteria are satisfied for performing the failover. The administrator may also input the criteria using the criteria determination component. For example, a replication lag of data between the second storage system 345 of the expected primary region and the first storage system 365 of the current primary region can be one of the criteria.
If there is no replication lag, e.g., the second storage system 345 of the expected primary region has all of the data associated with the first shard 151 that is stored at the current primary region, the first regional shard manager 327 requests the global shard manager 205 to promote the expected primary region, e.g., the second region 325, as the new primary region for the first shard 151. The first regional shard manager 327 can also demote the current primary region, e.g., the first region 350, to being the secondary region for the first shard 151. Any necessary services and processes for serving data access requests from the users associated with the first shard 151 are started at the second set of application servers 330 in the new primary region, e.g., the second region 325. Any data access requests from the users associated with the first shard 151 are now forwarded to the second set of application servers 330 in the second region 325. The global shard manager 205 also indicates the second regional shard manager 326 to update information indicating that the second region 325 is the primary region for the first shard 151.
Referring back to determining whether there is a replication lag, if there is a replication lag, then the first regional shard manager 327 can determine whether the replication lag is within a specified threshold. If the replication lag is within the specified threshold, the first regional shard manager 327 can wait until there is no replication lag. In some embodiments, data replication between regions can be performed via the data servers of the corresponding regions. While the first regional shard manager 327 is waiting for the replication lag to become zero, e.g., all data associated with the first shard 151 is copied to the second storage system 345 at the second region 325, the first regional shard manager 327 can block any incoming data access requests to the first region 350 from the users associated with the first shard 151 so that the replication lag does not increase.
Once the replication lag becomes zero, the first regional shard manager 327 can instruct the global shard manager 205 to promote the expected primary region, e.g., the second region 325, as the new primary region and demote the current primary region, e.g., the first region 350, to being the secondary region. After the second region 325 is promoted to the primary region for the first shard 151, the first region 350 can also forward any blocked data access requests to the second region 325. Referring back to determining whether the replication lag is below the specified threshold, if the first regional shard manager 327 determines that the replication is above the specified threshold, it can indicate to the global shard manager 205 that the fail over process may not be initiated, e.g., as a significant amount of data may be lost if the process is failed over.
As a result of the failover process, the global shard manager 205 can update the shard-region assignments, e.g., in the shard-region assignment table 225 to indicate the second region 325 is the primary region for the first shard 151. Similarly, the first regional shard manager 327 can update the shard-server assignments, e.g., in the shard-server assignment table 235.
In the event of the unplanned failover, e.g., due to application servers failing in the first region 350, the global shard manager 205 instructs the second region 325 to become the new primary region for the first shard 151 and fails over the application service 110 to the new primary region regardless of the replication lag between the first region 350 and the second region 325. If the replication lag of the new primary region is above the specified threshold, the application service 110 can be unavailable to the users associated with the first shard 151, e.g., up until the replication lag is below the threshold or is zero. In some embodiments, the application service can be made immediately available to the users regardless of the replication lag, however, the users may experience data loss in such a scenario.
In some embodiments, the first region 350 may become unavailable, e.g., due to a failure such as power failure, and therefore may not be used as the secondary region for the first shard 151, or any shard. The global shard manager may choose any other region, in addition to the third region 375, as the secondary region for the first shard 151. However, if available, the first region 350 can act as the secondary region for the first shard 151.
If the replication gap is large, the prepare demote process may not be completed and it may indicate to the global shard manager 205 that the failover process cannot be completed as the replication gap is above the specified threshold. In some embodiments, the prepare demote process can wait until the replication gap is below the specified threshold, and once the replication gap is below the specified threshold, a “gap small” process is initiated. The gap small process blocks any incoming data access requests (block 520) at the primary region 505 in order to further delay the replication or increase the replication lag. After the replication lag is zero, a “gap closed” process is initiated. The gap closed process can final demote the primary region 505 to the secondary region 510, and the “promote” process of the secondary region 510 can promote the secondary region 510 to being the primary region 505.
At block 620, the routing device 315 determines a primary region for the specified shard. For example, the routing device 315 requests the global shard manager 205 to determine the primary region for the specified shard. The global shard manager 205 can use the shard-region assignment table 225 to determine the primary region for the specified shard.
After the primary region is identified, at block 625, the routing device 315 determines the application server in the primary region that is assigned serve the data access requests for the specified shard. For example, the routing device 315 requests the first regional shard manager 327 in the primary region to determine the application server for the specified shard. The first regional shard manager 327 can use the shard-server assignment table 235 to determine the application server that serves the data access requests for the specified shard.
At block 630, the routing device 315 can send the data access request to the specified application server in the primary region.
At block 715, the regional shard manager of the current primary region of the specified shard, e.g., the first regional shard manager 327 of the first region 350, confirms that one or more criteria for failing over the application service 110 to the second region 325 is satisfied.
At block 720, the first regional shard manager 327 instructs the regional shard manager of the expected primary region of the specified shard, e.g., the second regional shard manager 326 of the second region 325, to promote the second region 325 to the primary region for the specified shard. The first regional shard manager 327 can demote the first region 350 to the secondary region for the specified shard.
At block 815, if the replication is below the specified threshold, the first regional shard manager 327 blocks any incoming data access requests for the specified shard at the first region 350, e.g., in order to keep the replication lag from increasing. If the replication lag is not below the specified threshold, the first regional shard manager 327 can indicate the global shard manager 205 that the fail over process 800 cannot be continued since the replication lag is beyond the specified threshold, and the process 800 can return.
At block 820, the first regional shard manager 327 determines if the replication lag is zero, e.g., all data associated with the first shard 151 in the first storage system 365 is copied to the second storage system 345 at the second region 325.
If the replication lag is not zero, the process 800 waits until the replication lag is zero, and continues to block any incoming data access requests for the specified shard at the first region 350. If the replication lag is zero, the second regional shard manager 326 can indicate its preparedness for promoting the second region 325 to the primary region to the global shard manager 205. The process 800 can then continue with the process described at least with reference to block 720 of
At block 915, the second regional shard manager 326 also maps the specified shard to an application server in the second region 325. For example, the global shard manager 205 can update a region-shard mapping assignment table associated with the second region 325, such as the shard-region assignment table 225, to indicate that the first shard is mapped to the application server “A21.”
At block 920, the first regional shard manager 327 can stop replicating data from the first region 350 to the second region 325. For example, the first regional shard manager 327 can instruct the first data server 370 to stop replicating the data associated with the first shard 151 to the second region 325.
At block 925, the second regional shard manager 326 can start replicating data associated with the specified shard from the second region 325 to the secondary regions of the specified shard. For example, the second data server 340 can replicate the data associated with the first shard 151 to the first region 350 and the third region 375.
At block 930, the first regional shard manager 327 forwards any blocked data access requests, e.g., that were blocked in as part of the process described at least with reference to block 815 of
At block 935, the first regional shard manager 327 forwards any new data access requests received at the first region 350 that are associated with the specified shard to the second region 325.
The memory 1010 and storage devices 1020 are computer-readable storage media that may store instructions that implement at least portions of the described embodiments. In addition, the data structures and message structures may be stored or transmitted via a data transmission medium, such as a signal on a communications link. Various communications links may be used, such as the Internet, a local area network, a wide area network, or a point-to-point dial-up connection. Thus, computer readable media can include computer-readable storage media (e.g., “non transitory” media).
The instructions stored in memory 1010 can be implemented as software and/or firmware to program the processor(s) 1005 to carry out actions described above. In some embodiments, such software or firmware may be initially provided to the processing system 1000 by downloading it from a remote system through the computing system 1000 (e.g., via network adapter 1030).
The embodiments introduced herein can be implemented by, for example, programmable circuitry (e.g., one or more microprocessors) programmed with software and/or firmware, or entirely in special-purpose hardwired (non-programmable) circuitry, or in a combination of such forms. Special-purpose hardwired circuitry may be in the form of, for example, one or more ASICs, PLDs, FPGAs, etc.
The above description and drawings are illustrative and are not to be construed as limiting. Numerous specific details are described to provide a thorough understanding of the disclosure. However, in some instances, well-known details are not described in order to avoid obscuring the description. Further, various modifications may be made without deviating from the scope of the embodiments. Accordingly, the embodiments are not limited except as by the appended claims.
Reference in this specification to “one embodiment” or “an embodiment” means that a specified feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosure. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Moreover, various features are described which may be exhibited by some embodiments and not by others. Similarly, various requirements are described which may be requirements for some embodiments but not for other embodiments.
The terms used in this specification generally have their ordinary meanings in the art, within the context of the disclosure, and in the specific context where each term is used. Terms that are used to describe the disclosure are discussed below, or elsewhere in the specification, to provide additional guidance to the practitioner regarding the description of the disclosure. For convenience, some terms may be highlighted, for example using italics and/or quotation marks. The use of highlighting has no influence on the scope and meaning of a term; the scope and meaning of a term is the same, in the same context, whether or not it is highlighted. It will be appreciated that the same thing can be said in more than one way. One will recognize that “memory” is one form of a “storage” and that the terms may on occasion be used interchangeably.
Consequently, alternative language and synonyms may be used for any one or more of the terms discussed herein, nor is any special significance to be placed upon whether or not a term is elaborated or discussed herein. Synonyms for some terms are provided. A recital of one or more synonyms does not exclude the use of other synonyms. The use of examples anywhere in this specification including examples of any term discussed herein is illustrative only, and is not intended to further limit the scope and meaning of the disclosure or of any exemplified term. Likewise, the disclosure is not limited to various embodiments given in this specification.
Those skilled in the art will appreciate that the logic illustrated in each of the flow diagrams discussed above, may be altered in various ways. For example, the order of the logic may be rearranged, substeps may be performed in parallel, illustrated logic may be omitted; other logic may be included, etc.
Without intent to further limit the scope of the disclosure, examples of instruments, apparatus, methods and their related results according to the embodiments of the present disclosure are given below. Note that titles or subtitles may be used in the examples for convenience of a reader, which in no way should limit the scope of the disclosure. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. In the case of conflict, the present document, including definitions will control.