Field of the Invention
This invention generally relates to database management systems and more specifically to detecting failures during the processing of a distributed database system.
Description of Related Art
The above-identified U.S. Pat. No. 8,224,860 discloses a distributed database management system comprising a network of transactional nodes and archival nodes. Archival nodes act as storage managers for all the data in the database. Each user connects to a transactional node to perform operations on the database by generating queries for being processed at that transactional node. A given transactional node need only contain that data and metadata as required to process queries from users connected to that node. This distributed database is defined by an array of atom classes, such as an index class, and atoms where each atom corresponds to a different instance of the class, such as index atom for a specific index. Replications or copies of a single atom may reside in multiple nodes wherein the atom copy in a given node is processed in that node.
In an implementation of such a distributed database asynchronous messages transfer among the different nodes to maintain the integrity of the database in a consistent and a concurrent state. Specifically each node in the database network has a unique communication path to every other node. When one node generates a message involving a specific atom, that message may be sent to every node that contains a replication of that specific atom. Each node generates these messages independently of other nodes. So at any given instant multiple nodes will contain copies of a given atom and different nodes may be at various stages of processing that atom. As the operations in different nodes are not synchronized it is important that the database be in a consistent and concurrent state at all times.
A major characteristic of such distributed databases is that all nodes be in communication with each other at all times so the database is completely connected. If a communications break occurs, the database is not considered to be connected. One or more nodes must be identified and may be removed from the network in an orderly manner. Such identification and removal must consider that any node can fail at any given time and that a communications break can occur only between two nodes or that multiple breaks can occur among several nodes. The identification of a node or nodes for removal must be accomplished in a reliable manner. Moreover such an identification should enable failure processes to resolve a failure with minimal interruption to users.
Therefore it is an object of this invention to provide a method for detecting a node failure in a distributed database management system.
Another object of this invention is to provide a method for detecting a node failure and for designating a node for failure.
Still another object of this invention is to provide a method for detecting a node failure and for designating a node for failure on a reliable basis.
Yet another object of this invention is to provide a method for detecting a node failure and for designating a node for failure with minimal interruption to users.
The appended claims particularly point out and distinctly claim the subject matter of this invention. The various objects, advantages and novel features of this invention will be more fully apparent from a reading of the following detailed description in conjunction with the accompanying drawings in which like reference numerals refer to like parts, and in which:
Each node in
The database request engine 41 only exists on transactional nodes and is the interface between the high-level input and output commands at the user level and system level input and output commands at the system level. In general terms, its database request engine parses, compiles and optimizes user queries such as SQL queries into commands that are interpreted by the various classes or objects in the set 42. The classes/objects set 42 is divided into a subset 43 of “atom classes,” a subset 44 of “message classes” and a subset 45 of “helper classes.”
Referring to
The atom classes collectively define atoms that contain fragments of the database including metadata and data. Atoms are replicated from one node to another so that users at different nodes are dealing with consistent and concurrent versions of the database, even during actions taken to modify the information in any atom. At any given time there is no guarantee that all replicated copies of a single atom will be in the same state because there is no guarantee that messages from one node to other nodes will be processed concurrently.
As previously indicated, communications between any two nodes is by way of serialized messages which are transmitted asynchronously using the TCP or other protocol with controls to maintain message sequences.
The failure to receive a ping acknowledgement within a predetermined time indicates a break in communications with respect to messages being sent between the requesting or sending node and the recipient or receiving node. In the context of this invention a first node transmits the Ping message 110 to another node and functions as an “informer node” or I-Node if the Ping Acknowledge signal is not received. If there is a failure, the I-Node identifies the receiving node as being suspicious (e.g. an “S-Node”) by means of a Suspicious Node message 159. A Leader Acknowledge message 160 triggers a request for each I-Node to respond with information about any suspicious nodes that connect to that specific I-Node.
A purpose of this invention is to detect a node failure and enable corrective action.
Referring to
If no Ping Acknowledge message is received within a defined time interval, it is assumed that a communications break exists. Step 215 marks that receiving node as being “suspicious” with respect to the sending I-Node. Step 216 sends a Suspicious Node message 159 that includes the I-Node identification and Suspicious Node identification in 216 to a Leader Node.
A Leader Node is responsible for analyzing the information received from all the I-Nodes. Only one L-Node can exist at a time and it must be a non-suspicious node. A node can only act as an L-Node if it has received a response for a given set of S-Nodes from a majority of the database as represented by other I-Nodes (i.e., majority of active, non-suspicious nodes). If the active I-Node receives a Leader Acknowledgement message in a timely manner, step 217 uses step 214 to select a next node to be tested by the I-node and control returns to step 212. Otherwise, there is no certainty as to whether the I-node or the node being tested is causing the communications break. Step 220 selects the next non-suspicious node as the new Leader Node. If it is available, step 221 returns control to step 216 and the message is resent to the new L-Node. If no node qualifies as an L-Node, an error state is generated in step 222.
With respect to the process shown in
If a majority does not exist at the instant of step 253, step 254 transfers control to step 257 that times out the first time interval. If the majority is reached prior to the expiration of that time interval, step 257 diverts control to steps 255 and 256. Otherwise step 257 transfers control to step 260 that sends a Leader Acknowledge message to all I-Nodes and then waits in step 261 for a second time interval to determine whether a majority of I-Nodes respond. At the end of that interval control passes through step 262 to step 255 if a majority of I-Nodes has been identified. If the second time interval expires without obtaining a majority, step 262 diverts to establish an error state at step 263.
After either step 275, step 276 completes its process and control returns to step 271 to execute the process for another I-Node. When completed all the I-Nodes have been processed the node designations are made available to the failure system 200 in
Therefore there has been disclosed an embodiment of this invention wherein each node operating with a distributed database monitors communication paths with all other nodes in the network. Any communications break is noted and the network is analyzed to determine nodes that need to be failed. This information is reported for processing whereby failures are handled in orderly fashion with minimal interruption to user activities and in a manner in which data remains consistent and concurrent. More specifically, this invention enhances user access because it detects node failure in an orderly and efficient manner to enable appropriate failure system to maintain the database in a concurrent and consistent manner.
It will be apparent that many modifications can be made to the disclosed apparatus without departing from the invention. For example, this invention has been described with a “majority” defined as a majority of a subset of all the active nodes. In other applications, the “majority” might be defined as a majority of the archival nodes. Still other subsets of nodes could be defined as the basis for determining a majority. Specific messages and processes have been given different titles for purposes of explanation. It will be apparent that equivalent messages and processes could be designed by different names while still performing the same or equivalent functions. Therefore, it is the intent of the appended claims to cover all such variations and modifications as come within the true spirit and scope of this invention.
U.S. Pat. No. 8,224,860 granted Jul. 17, 2012 for a Database Management System and assigned to the same assignee as this invention is incorporated in its entirety herein by reference. This application claims priority from U.S. Provisional Application Ser. No. 61/789,370 filed Mar. 15, 2013 for Node Failure Detection for a Distributed Database, which application is incorporated in its entirety by reference.
Number | Name | Date | Kind |
---|---|---|---|
5446887 | Berkowitz | Aug 1995 | A |
5524240 | Barbara et al. | Jun 1996 | A |
5555404 | Torbjørnsen et al. | Sep 1996 | A |
5701467 | Freeston | Dec 1997 | A |
5764877 | Lomet et al. | Jun 1998 | A |
5960194 | Choy et al. | Sep 1999 | A |
6216151 | Antoun | Apr 2001 | B1 |
6275863 | Leff et al. | Aug 2001 | B1 |
6334125 | Johnson et al. | Dec 2001 | B1 |
6401096 | Zellweger | Jun 2002 | B1 |
6424967 | Johnson et al. | Jul 2002 | B1 |
6480857 | Chandler | Nov 2002 | B1 |
6499036 | Gurevich | Dec 2002 | B1 |
6523036 | Hickman et al. | Feb 2003 | B1 |
6748394 | Shah et al. | Jun 2004 | B2 |
6862589 | Grant | Mar 2005 | B2 |
7028043 | Bleizeffer et al. | Apr 2006 | B2 |
7080083 | Kim et al. | Jul 2006 | B2 |
7096216 | Anonsen | Aug 2006 | B2 |
7219102 | Zhou et al. | May 2007 | B2 |
7233960 | Boris et al. | Jun 2007 | B1 |
7293039 | Deshmukh et al. | Nov 2007 | B1 |
7395352 | Lam et al. | Jul 2008 | B1 |
7401094 | Kesler | Jul 2008 | B1 |
7403948 | Ghoneimy et al. | Jul 2008 | B2 |
7562102 | Sumner et al. | Jul 2009 | B1 |
7853624 | Friedlander et al. | Dec 2010 | B2 |
7890508 | Gerber et al. | Feb 2011 | B2 |
8108343 | Wang et al. | Jan 2012 | B2 |
8224860 | Starkey | Jul 2012 | B2 |
8266122 | Newcombe et al. | Sep 2012 | B1 |
8504523 | Starkey | Aug 2013 | B2 |
8756237 | Stillerman et al. | Jun 2014 | B2 |
20020112054 | Hatanaka | Aug 2002 | A1 |
20020152261 | Arkin et al. | Oct 2002 | A1 |
20020152262 | Arkin et al. | Oct 2002 | A1 |
20030051021 | Herschfeld et al. | Mar 2003 | A1 |
20030204486 | Berks et al. | Oct 2003 | A1 |
20030204509 | Dinker | Oct 2003 | A1 |
20030220935 | Vivian et al. | Nov 2003 | A1 |
20030233595 | Charny | Dec 2003 | A1 |
20040153459 | Whitten | Aug 2004 | A1 |
20050086384 | Ernst | Apr 2005 | A1 |
20050216502 | Kaura et al. | Sep 2005 | A1 |
20080320038 | Liege | Dec 2008 | A1 |
20100094802 | Luotojarvi et al. | Apr 2010 | A1 |
20100115338 | Rao | May 2010 | A1 |
20100153349 | Schroth | Jun 2010 | A1 |
20100235606 | Oreland et al. | Sep 2010 | A1 |
20100297565 | Waters et al. | Nov 2010 | A1 |
20110087874 | Timashev et al. | Apr 2011 | A1 |
20110231447 | Starkey | Sep 2011 | A1 |
20120136904 | Venkata Naga Ravi | May 2012 | A1 |
20120254175 | Horowitz et al. | Oct 2012 | A1 |
20130060922 | Koponen et al. | Mar 2013 | A1 |
20130110766 | Promhouse et al. | May 2013 | A1 |
20130159265 | Peh et al. | Jun 2013 | A1 |
20130232378 | Resch et al. | Sep 2013 | A1 |
20130297976 | McMillen | Nov 2013 | A1 |
20130311426 | Erdogan et al. | Nov 2013 | A1 |
20140279881 | Tan et al. | Sep 2014 | A1 |
20140297676 | Bhatia et al. | Oct 2014 | A1 |
20140304306 | Proctor et al. | Oct 2014 | A1 |
Number | Date | Country |
---|---|---|
1403782 | Mar 2004 | EP |
2003-256256 | Sep 2003 | JP |
2006-048507 | Feb 2006 | JP |
2007-058275 | Mar 2007 | JP |
Entry |
---|
U.S. Appl. No. 14/215,401, filed Mar. 17, 2014, Palmer. |
U.S. Appl. No. 14/215,461, filed Mar. 17, 2014, Palmer. |
U.S. Appl. No. 14/616,713, filed Feb. 8, 2015, Levin. |
U.S. Appl. No. 14/688,396, filed Apr. 16, 2015, Shaull. |
U.S. Appl. No. 14/725,916, filed May 29. 2015, Rice. |
U.S. Appl. No. 14/726,200, filed May 29, 2015, Palmer. |
U.S. Appl. No. 14/744,546, filed Jun. 19, 2015, Massari. |
“Album Closing Policy,” Background, retrieved from the Internet at URL:http://tools/wiki/display/ENG/Album+Closing+Policy (Jan. 29, 2015), 4 pp. |
“Distributed Coordination in NuoDB,” YouTube, retrieved from the Internet at URL:https://www.youtube.com/watch?feature=player—embedded&v=URoeHvflVKg on Feb. 4, 2015, 2 pp. |
Bergsten et al., “Overview of Parallel Architectures for Databases,” The Computer Journal vol. 36, No. 8, pp. 734-740 (1993). |
Dan et al., “Performance Comparisons of Buffer Coherency Policies,” Proceedings of the International Conference on Distributed Computer Systems, IEEE Comp. Soc. Press vol. 11, pp. 208-217 (1991). |
Durable Distributed Cache Architecture, retrieved from the Internet at URL: http://www.nuodb.com/explore/newsql-cloud-database-ddc-architecture on Feb. 4, 2015, 3 pp. |
“Glossary—NuoDB 2.1 Documentation/NuoDB,” retrieved from the Internet at URL: http://doc.nuodb.com/display/doc/Glossary on Feb. 4, 2015, 1 pp. |
“How It Works,” retrieved from the Internet at URL: http://www.nuodb.com/explore/newsql-cloud-database-how-it-works?mkt—tok=3RkMMJW on Feb. 4, 2015, 4 pp. |
“How to Eliminate MySQL Performance Issues,” NuoDB Technical Whitepaper, Sep. 10, 2014, Version 1, 11 pp. |
“Hybrid Transaction and Analytical Processing with NuoDB,” NuoDB Technical Whitepaper, Nov. 5, 2014, Version 1, 13 pp. |
International Preliminary Report on Patentability mailed Oct. 13, 2015 from PCT/US2014/033270, 4 pp. |
International Search Report and Written Opinion mailed Aug. 21, 2014 from PCT/US2014/033270, 5 pp. |
International Search Report mailed Sep. 26, 2012 from PCT/US2011/029056, 4 pp. |
Leverenz et al., “Oracle8i Concepts, Partitioned Tables and Indexes,” Chapter 11, pp. 11-2-11/66 (1999). |
“No Knobs Administration,” retrieved from the Internet at URL: http://www.nuodb.com/explore/newsql-cloud-database-product/auto-administration on Feb. 4, 2015, 4 pp. |
Non-Final Office Action mailed Jan. 21, 2016 from U.S. Appl. No. 14/215,401, 19 pp. |
Non-Final Office Action mailed Feb. 1, 2016 from U.S. Appl. No. 14/215,461, 19 pp. |
Non-Final Office Action mailed Feb. 06, 2014 from U.S. Appl. No. 13/933,483, 14 pp. |
Non-Final Office Action mailed May 19, 2016 from U.S. Appl. No. 14/247,364, 24 pp. |
Non-Final Office Action mailed Oct. 10, 2012 from U.S. Appl. No. 13/525,953, 8 pp. |
Notice of Allowance mailed Feb. 29, 2012 from U.S. Appl. No. 13/051,750, 8 pp. |
Notice of Allowance mailed Apr. 1, 2013 from U.S. Appl. No. 13/525,953, 10 pp. |
Notice of Allowance mailed May 14, 2012 from U.S. Appl. No. 13/051,750, 8 pp. |
NuoDB at a Glance, retrieved from the Internet at URL: http://doc.nuodb.com/display/doc/NuoDB+at+a+Glance on Feb. 4, 2015, 1 pp. |
Rahimi, S. K. et al., “Distributed Database Management Systems: A Practical Approach,” IEEE Computer Society, John Wiley & Sons, Inc. Publications (2010), 765 pp. |
Shaull, R. et al., “A Modular and Efficient Past State System for Berkeley DB,” Proceedings of USENIX ATC '14:2014 USENIX Annual Technical Conference, 13 pp. (Jun. 19-20, 2014). |
Shaull, R. et al., “Skippy: A New Snapshot Indexing Method for Time Travel in the Storage Manager,” SIGMOD'08, Jun. 9-12, 2008, 12 pp. |
Shaull, R., “Retro: A Methodology for Retrospection Everywhere,” A Dissertation Presented to the Faculty of the Graduate School of Arts and Sciences of Brandeis University, Waltham, Massachusetts, Aug. 2013, 174 pp. |
“SnapShot Albums,” Transaction Ordering, retrieved from the Internet at URL:http://tools/wiki/display/ENG/Snapshot+Albums (Aug. 12, 2014), 4 pp. |
“Table Partitioning and Storage Groups (TPSG),” Architect's Overview, NuoDB Technical Design Document, Version 2.0 (2014), 12 pp. |
“The Architecture & Motivation for NuoDB,” NuoDB Technical Whitepaper, Oct. 5, 2014, Version 1, 27 pp. |
“Welcome to NuoDB Swifts Release 2.1 GA,” retrieved from the Internet at URL: http:.//dev.nuodb.com/techblog/welcome-nuodb-swifts-release-21-ga on Feb. 4, 2015, 7 pp. |
“What Is A Distributed Database? And Why Do You Need One,” NuoDB Technical Whitepaper, Jan. 23, 2014, Version 1, 9 pp. |
Yousif, M. “Shared-Storage Clusters,” Cluster Computing, Baltzer Science Publishers, Bussum, NL, vol. 2, No. 4, pp. 249-257 (1999). |
Final Office Action mailed Sep. 9, 2016 from U.S. Appl. No. 14/215,461, 26 pp. |
International Search Report and Written Opinion mailed Jul. 15, 2016 from PCT/US2016/27658, 37 pp. |
International Search Report and Written Opinion mailed Sep. 8, 2016 from PCT/US16/37977, 11 pp. |
International Search Report and Written Opinion mailed Sep. 9, 2016 from PCT/US16/34646, 12 pp. |
Non-Final Office Action mailed Sep. 23, 2016 from U.S. Appl. No. 14/616,713, 8 pp. |
Number | Date | Country | |
---|---|---|---|
61789370 | Mar 2013 | US |