The present invention generally relates to ensuring data consistency across nodes of a synchronous replication cluster.
MySQL (a trademark of MySQL AB Limited Company or its successors) is a popular open source database management system. Natively, MySQL may be configured to replicate data from a master node to a slave node asynchronously or semi-synchronously, but not synchronously.
In asynchronous data replication, data is replicated from a master node to a slave node independent of when transactions are committed at the master node. Thus, transactions may be committed at the master node without consideration to when the transactions will be replicated to the slave node. Asynchronous replication therefore enables transactions to be committed relatively quickly at the master node, but if the master node becomes inoperable, there is no guarantee that a transaction committed at the master node has been replicated to the slave node. In asynchronous replication, the data stored on the slave node may not be current with data stored on the master node. As a result, read operations performed on the slave node may read out of date data. Further, if the master node crashes, then the slave node may not have the most recent set of data, resulting in data loss.
In semi-synchronous replication, a transaction is only committed at the master node when the master node receives acknowledgement that the slave node has received a copy of the transaction. Thus, when a transaction is committed at the master node, there is a guarantee that the slave node has at least received the transaction.
In synchronous replication, a transaction is only committed at the master node when the master node receives acknowledgement that the slave node has committed the transaction. Thus, when a transaction is committed at the master node, there is a guarantee that the slave node has also committed the transaction. Synchronous replication therefore requires additional time to commit a transaction at the master node than compared to asynchronous replication; however, if the master node becomes inoperable, there is a guarantee that the state of the database maintained by the slave node is consistent with the state of the database at the master node prior to the master node becoming inoperable.
MySQL may be configured to employ a third party library to provide additional functionality to a MySQL installation. For example, MySQL may be used in conjunction with a third party synchronous replication library, such as Galera. A MySQL server integrated with the Galera library enables a plurality of MySQL servers to interact with each other in a master-slave synchronous replication relationship.
In a Galera-based master-slave MySQL synchronous replication cluster, one MySQL server functions as a master and one or more MySQL servers function as a slave. The MySQL master server can handle both read and write requests while a MySQL slave server can handle only read requests. MySQL clients may only send write transactions to the MySQL master server but may send read transactions to either the MySQL master server or any MySQL slave servers.
A write set is prepared at the master for each MySQL write transaction initiated at the master. A write set is a set of information that may be used to perform the write operations that are specified by the requested MySQL write transaction. The write set is replicated from the master to each slave and is used by each slave to perform the write operations that are specified by the requested write transaction at the slave. Each slave uses write sets to commit the write transaction.
In a Galera-based master-slave MySQL synchronous replication cluster, write transactions received at a MySQL master server are replicated synchronously to each MySQL slave server. When a MySQL slave server receives a particular read query, to ensure data consistency, the MySQL slave server waits for all transactions, received by the MySQL slave server prior to the particular read query, to be committed prior to processing the particular read query. Unfortunately, if the MySQL master server receives a large volume of write transactions, then the performance of performing read queries at each MySQL slave server is poor.
Discussion in this section is meant to provide an understanding of prior approaches to the reader as they relate to embodiments of the invention. However, the disadvantages of prior approaches discussed in this section are not meant to be an admission that such disadvantages were publically known. Consequently, recognition of the disadvantages of the prior art discussed herein is not meant to be construed as being within the prior art simply by virtue of its inclusion in this section alone.
Embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
Approaches for ensuring data consistency across nodes of a MySQL synchronous replication cluster are described. In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.
Embodiments of the invention ensure consistency between MySQL databases maintained at different nodes of a MySQL synchronous replication cluster in an efficient manner.
Each MySQL server in MySQL synchronous replication cluster 100, regardless of whether it is a master server or a slave server, may be configured to use a third party synchronous replication library, such as Galera. Using such a third party synchronous replication library, a node of a cluster acting as a master may replicate data synchronously to the other nodes of the cluster that act as slaves. The master node can handle both read and write requests from MySQL clients 140 while slave nodes can handle only read requests from MySQL clients 140.
When MySQL master server 110 receives a write request from a particular MySQL client, MySQL master server 110 performs the requested write transaction against a MySQL database maintained by MySQL master server 110. Thereafter, prior to committing the transaction, MySQL master server 110 replicates the write set for the transaction to each MySQL slave server in the MySQL synchronous replication cluster 100 (for example, MySQL slave servers 120 and 130 in
To illustrate replicating a write set using an exemplary transaction, consider
Embodiments of the invention enable read queries to be processed by MySQL slave servers in less time than in prior approaches.
In step 310, a MySQL client in one or more MySQL clients 140 sends a read query to MySQL slave server 120. MySQL slave server 120 parses the read query and determines the read set for the read query. A read set is the set of objects or values that are requested to be read in the read query. A read set is used to determine if the read query conflicts with one or more write-sets in a write set-conflict window. Non-limiting, illustrative examples of the information that a read-set would contain one or more databases that the read query depends on, one or more database tables that the read query depends on, and row and column information that the read query depends on.
A write-set conflict window for a read query is a set of (a) write-sets in the write-set queue waiting to be processed and (b) the write-sets currently being processed when the read query is received.
A specific example of performing step 310 shall be explained with reference to transaction 2 that corresponds to Read A, B, C. The read set for transaction 2 will correspond to any write operations that affect the values of A, B, or C. The write-set conflict window will thus be any transaction, either currently being processed by MySQL slave server 120 or residing in write-set queue 122 waiting to be processed, that changes the value of A, B, or C.
In step 320, MySQL slave server 120 determines the write set conflict window for the read query.
In step 330, MySQL slave server 120 determines if the read set determined in step 310 conflicts with one or more write sets in the write set conflict window determined in step 320. A read set is considered to conflict with a write set in a write set conflict window if committing the write set would result in a different value to read by the read query compared to what would have been read if the read query was processed outside of the read set conflict window.
In the prior example, the read set would conflict with a write set in the write set conflict window if a write set in the write set conflict window updates the value of A, B, or C. As shown in
In step 340, MySQL slave server 120 waits for all the conflicting write sets in the write set conflict window to be committed. Once all conflicting write sets in the write set conflict window are committed by MySQL slave server 120, processing proceeds to step 350.
In step 350, MySQL slave server 120 processes the read query. By performing the steps of
Embodiments of the invention are directed towards ensuring data consistency across nodes of a MySQL synchronous replication cluster. Nodes of the MySQL synchronous replication cluster may be implemented on a wide variety of hardware. For example, nodes of the MySQL synchronous replication cluster may chiefly or wholly employ the use of solid state devices to persistently store data. In an embodiment, the architecture of embodiments is specifically tailored for using solid state devices in a fast, efficient, and scalable manner to obtain better performance than prior approaches. For example, each node of synchronous replication cluster 100 may correspond to a device 100 described in U.S. patent application Ser. No. 12/983,754.
In the foregoing specification, embodiments of the invention have been described with reference to numerous specific details that may vary from implementation to implementation. Thus, the sole and exclusive indicator of what is the invention, and is intended by the applicants to be the invention, is the set of claims that issue from this application, in the specific form in which such claims issue, including any subsequent correction. Any definitions expressly set forth herein for terms contained in such claims shall govern the meaning of such terms as used in the claims. Hence, no limitation, element, property, feature, advantage or attribute that is not expressly recited in a claim should limit the scope of such claim in any way. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
This application is a continuation-in-part of, and claims priority to, U.S. non-provisional patent application Ser. No. 12/983,754, entitled “Efficient Flash-Memory Based Object Store,” filed on Jan. 3, 2011, invented by John Busch et al., the entire contents of which are incorporated by reference for all purposes as if fully set forth herein. This application is also a continuation-in-part of, and claims priority to, U.S. non-provisional patent application Ser. No. 12/983,758, entitled “Flexible Way of Specifying Storage Attributes in a Flash-Memory Based Object Store,” filed on Jan. 3, 2011, invented by Darryl Ouye et al., the entire contents of which are incorporated by reference for all purposes as if fully set forth herein. This application is also a continuation-in-part of, and claims priority to, U.S. non-provisional patent application Ser. No. 12/983,762, entitled “Minimizing Write Operations to a Flash Memory-Based Object Store,” filed on Jan. 3, 2011, invented by Darpan Dinker, the entire contents of which are incorporated by reference for all purposes as if fully set forth herein.
Number | Name | Date | Kind |
---|---|---|---|
4916605 | Beardsley et al. | Apr 1990 | A |
5046002 | Takashi et al. | Sep 1991 | A |
5057996 | Cutler et al. | Oct 1991 | A |
5117350 | Parrish et al. | May 1992 | A |
5212789 | Rago | May 1993 | A |
5287496 | Chen et al. | Feb 1994 | A |
5297258 | Hale et al. | Mar 1994 | A |
5394555 | Hunter et al. | Feb 1995 | A |
5403639 | Belsan et al. | Apr 1995 | A |
5423037 | Hvasshovd | Jun 1995 | A |
5509134 | Fandrich et al. | Apr 1996 | A |
5537534 | Voigt et al. | Jul 1996 | A |
5603001 | Sukegawa et al. | Feb 1997 | A |
5611057 | Pecone et al. | Mar 1997 | A |
5613071 | Rankin et al. | Mar 1997 | A |
5680579 | Young et al. | Oct 1997 | A |
5692149 | Lee | Nov 1997 | A |
5701480 | Raz | Dec 1997 | A |
5742787 | Talreja | Apr 1998 | A |
5887138 | Hagersten et al. | Mar 1999 | A |
5897661 | Baranovsky et al. | Apr 1999 | A |
5897664 | Nesheim et al. | Apr 1999 | A |
5963983 | Sakakura et al. | Oct 1999 | A |
6000006 | Bruce et al. | Dec 1999 | A |
6052815 | Zook | Apr 2000 | A |
6130759 | Blair | Oct 2000 | A |
6141692 | Loewenstein et al. | Oct 2000 | A |
6216126 | Ronstrom | Apr 2001 | B1 |
6298390 | Matena et al. | Oct 2001 | B1 |
6308169 | Ronstrom et al. | Oct 2001 | B1 |
6434144 | Romanov | Aug 2002 | B1 |
6467060 | Malakapalli et al. | Oct 2002 | B1 |
6615313 | Kato et al. | Sep 2003 | B2 |
6658526 | Nguyen et al. | Dec 2003 | B2 |
6728826 | Kaki et al. | Apr 2004 | B2 |
6745209 | Holenstein et al. | Jun 2004 | B2 |
6874044 | Chou et al. | Mar 2005 | B1 |
6938084 | Gamache et al. | Aug 2005 | B2 |
6981070 | Luk et al. | Dec 2005 | B1 |
7003586 | Bailey et al. | Feb 2006 | B1 |
7010521 | Hinshaw et al. | Mar 2006 | B2 |
7043621 | Merchant et al. | May 2006 | B2 |
7082481 | Lambrache et al. | Jul 2006 | B2 |
7162467 | Eshleman et al. | Jan 2007 | B2 |
7200718 | Duzett | Apr 2007 | B2 |
7203890 | Normoyle | Apr 2007 | B1 |
7249280 | Lamport et al. | Jul 2007 | B2 |
7269708 | Ware | Sep 2007 | B2 |
7269755 | Moshayedi et al. | Sep 2007 | B2 |
7272605 | Hinshaw et al. | Sep 2007 | B1 |
7272654 | Brendel | Sep 2007 | B1 |
7281160 | Stewart | Oct 2007 | B2 |
7305386 | Hinshaw et al. | Dec 2007 | B2 |
7334154 | Lorch et al. | Feb 2008 | B2 |
7359927 | Cardente | Apr 2008 | B1 |
7383290 | Mehra et al. | Jun 2008 | B2 |
7406487 | Gupta et al. | Jul 2008 | B1 |
7417992 | Krishnan | Aug 2008 | B2 |
7467265 | Tawri et al. | Dec 2008 | B1 |
7529882 | Wong | May 2009 | B2 |
7542968 | Yokomizo et al. | Jun 2009 | B2 |
7562162 | Kreiner et al. | Jul 2009 | B2 |
7584222 | Georgiev | Sep 2009 | B1 |
7610445 | Manus et al. | Oct 2009 | B1 |
7647449 | Roy et al. | Jan 2010 | B1 |
7809691 | Karmarkar et al. | Oct 2010 | B1 |
7822711 | Ranade | Oct 2010 | B1 |
7885923 | Tawri et al. | Feb 2011 | B1 |
7917472 | Persson | Mar 2011 | B2 |
8015352 | Zhang et al. | Sep 2011 | B2 |
8018729 | Skinner | Sep 2011 | B2 |
8024515 | Auerbach et al. | Sep 2011 | B2 |
8037349 | Mandagere et al. | Oct 2011 | B2 |
8069328 | Pyeon | Nov 2011 | B2 |
8239617 | Linnell | Aug 2012 | B1 |
8261289 | Kasravi et al. | Sep 2012 | B2 |
8321450 | Thatte et al. | Nov 2012 | B2 |
8335776 | Gokhale | Dec 2012 | B2 |
8370853 | Giampaolo et al. | Feb 2013 | B2 |
8401994 | Hoang et al. | Mar 2013 | B2 |
20020166031 | Chen et al. | Nov 2002 | A1 |
20020184239 | Mosher, Jr. et al. | Dec 2002 | A1 |
20030016596 | Chiquoine et al. | Jan 2003 | A1 |
20030097610 | Hofner | May 2003 | A1 |
20030177408 | Fields et al. | Sep 2003 | A1 |
20030220985 | Kawamoto et al. | Nov 2003 | A1 |
20040010502 | Bomfim et al. | Jan 2004 | A1 |
20040078379 | Hinshaw et al. | Apr 2004 | A1 |
20040143562 | Chen et al. | Jul 2004 | A1 |
20040148283 | Harris et al. | Jul 2004 | A1 |
20040172494 | Pettey et al. | Sep 2004 | A1 |
20040205151 | Sprigg et al. | Oct 2004 | A1 |
20040230862 | Merchant et al. | Nov 2004 | A1 |
20040267835 | Zwilling et al. | Dec 2004 | A1 |
20050005074 | Landin et al. | Jan 2005 | A1 |
20050021565 | Kapoor et al. | Jan 2005 | A1 |
20050027701 | Zane et al. | Feb 2005 | A1 |
20050028134 | Zane et al. | Feb 2005 | A1 |
20050034048 | Nemawarkar et al. | Feb 2005 | A1 |
20050081091 | Bartfai et al. | Apr 2005 | A1 |
20050086413 | Lee et al. | Apr 2005 | A1 |
20050120133 | Slack-Smith | Jun 2005 | A1 |
20050131964 | Saxena | Jun 2005 | A1 |
20050240635 | Kapoor et al. | Oct 2005 | A1 |
20050246487 | Ergan et al. | Nov 2005 | A1 |
20060059428 | Humphries et al. | Mar 2006 | A1 |
20060161530 | Biswal et al. | Jul 2006 | A1 |
20060174063 | Soules et al. | Aug 2006 | A1 |
20060174069 | Shaw et al. | Aug 2006 | A1 |
20060179083 | Kulkarni et al. | Aug 2006 | A1 |
20060195648 | Chandrasekaran et al. | Aug 2006 | A1 |
20060212795 | Cottrille et al. | Sep 2006 | A1 |
20060218210 | Sarma et al. | Sep 2006 | A1 |
20060242163 | Miller et al. | Oct 2006 | A1 |
20060253724 | Zhang | Nov 2006 | A1 |
20070043790 | Kryger | Feb 2007 | A1 |
20070143368 | Lundsgaard et al. | Jun 2007 | A1 |
20070174541 | Chandrasekaran et al. | Jul 2007 | A1 |
20070234182 | Wickeraad et al. | Oct 2007 | A1 |
20070276784 | Piedmonte | Nov 2007 | A1 |
20070283079 | Iwamura et al. | Dec 2007 | A1 |
20070288692 | Bruce et al. | Dec 2007 | A1 |
20070288792 | Thorpe et al. | Dec 2007 | A1 |
20070294564 | Reddin et al. | Dec 2007 | A1 |
20070299816 | Arora et al. | Dec 2007 | A1 |
20080034076 | Ishikawa et al. | Feb 2008 | A1 |
20080034174 | Traister et al. | Feb 2008 | A1 |
20080034249 | Husain et al. | Feb 2008 | A1 |
20080046538 | Susarla et al. | Feb 2008 | A1 |
20080046638 | Maheshwari et al. | Feb 2008 | A1 |
20080126706 | Newport et al. | May 2008 | A1 |
20080288713 | Lee et al. | Nov 2008 | A1 |
20090006500 | Shiozawa et al. | Jan 2009 | A1 |
20090006681 | Hubert et al. | Jan 2009 | A1 |
20090006888 | Bernhard et al. | Jan 2009 | A1 |
20090019456 | Saxena et al. | Jan 2009 | A1 |
20090024871 | Emaru et al. | Jan 2009 | A1 |
20090030943 | Kall | Jan 2009 | A1 |
20090070530 | Satoyama et al. | Mar 2009 | A1 |
20090150599 | Bennett | Jun 2009 | A1 |
20090177666 | Kaneda | Jul 2009 | A1 |
20100125695 | Wu et al. | May 2010 | A1 |
20100241895 | Li et al. | Sep 2010 | A1 |
20100262762 | Borchers et al. | Oct 2010 | A1 |
20100318821 | Kwan et al. | Dec 2010 | A1 |
20110022566 | Beaverson et al. | Jan 2011 | A1 |
20110082985 | Haines et al. | Apr 2011 | A1 |
20110099420 | MacDonald McAlister | Apr 2011 | A1 |
20110167038 | Wang et al. | Jul 2011 | A1 |
20110179279 | Greevenbosch et al. | Jul 2011 | A1 |
20110185147 | Hatfield et al. | Jul 2011 | A1 |
Number | Date | Country |
---|---|---|
1548600 | Jan 2007 | EP |
1746510 | Jan 2007 | EP |
Entry |
---|
Ajmani, Automatic Software Upgrades for Distributed Systems, MIT, Sep. 2004, 164 pgs. |
Amza, Data Replication Strategies for Fault Tolerance and Availability on Commodity Clusters, 2000, 9 pgs. |
bsn-modulestore, Versioning Concept, Oct. 13, 2010, 2 pgs. |
Btrfs, http://en.wikipedia.org, Oct. 3, 2011, 9 pgs. |
Buchholz, The Structure of the Reiser File System, Jan. 26, 2006, 21 pgs. |
Chacon, Git, The Fast Version Control System, Oct. 3, 2011, 3 pgs. |
Email Communication from James Bodwin to Christopher Brokaw re prior art, Sep. 13, 2011, 4 pgs. |
Git (Software), http://en.wikipedia.org, Oct. 3, 2011, 10 pgs. |
Hitz, File System Design for an NFS File Server Appliance, Jan. 19, 1994, 23 pgs. |
McDonald, Architectural Semantics for Practical Transactional Memory, Jun. 2006, 12 pgs. |
McGonigle, A Short History of btrfs, Aug. 14, 2009, 11 pgs. |
Mellor, ZFS—the future of file systems? Aug. 14, 2006, 5 pgs. |
Mercurial, http://en.wikipedia.org, Oct. 2, 2011, 6 pages. |
Module: Mongoid: Versioning, http://rdoc.info, Documentation by YARD 0.7.2, 6 pages Oct. 3, 2011. |
Noach, Database Schema under Version Control, code.openarck.org, Apr. 22, 2010, 6 pages. |
Reiser FS, http://enwikipedia.org, Sep. 17, 2011, 5 pgs. |
Rice, Extension Versioning, Update and Compatibility, Aug. 9, 2011, 11 pgs. |
Rice, Toolkit Version Format, Aug. 19, 2011, 4 pgs. |
Russell, Track and Record Database Schema Versions, Jun. 28, 2005, 8 pgs. |
Schooner Information Technology, IPAF, PCT/US2008/065167, Oct. 23, 2008, 7 pgs. |
Schooner Information Technology, ISR/WO, PCT/US2008/065167, Jan. 28, 2009, 16 pgs. |
SQL Server Database Schema Versioning and Update, Dec. 2, 2009, 2 pgs. |
Sufficiently Advanced Bug, File Versioning, Caching and Hashing, Oct. 3, 2011, 3 pgs. |
The Z File System (ZFS), FreeBSD Handbook, Oct. 3, 2011, 8 pgs (Author not provided). |
Tux3 Linux Filesystem Project, 2008, 1 pg. |
Tux3 Versioning Filesystem, Jul. 2008, 67 pgs. |
Tux3, http://en.wikipedia.org, Jun. 2, 2010, 3 pgs. |
Vijaykumar, Speculative Versioning Cache, Dec. 1, 2011, 13 pgs. |
WAFL—Write Anywhere File Layout, 1999, 1 pg. |
Write Anywhere File Layout, Sep. 9, 2011, 2 pgs. |
ZFS, , http://en.wikipedia.org Sep. 30, 2011, 18 pgs. |
Number | Date | Country | |
---|---|---|---|
20130151467 A1 | Jun 2013 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 12983754 | Jan 2011 | US |
Child | 13399982 | US | |
Parent | 12983758 | Jan 2011 | US |
Child | 12983754 | US | |
Parent | 12983762 | Jan 2011 | US |
Child | 12983758 | US |