The following Detailed Description will first present an overview of an improved technique for automatically changing to a different transport mode, will then present two implementations of the improved maximum availability mode, and will finally present details of how the second implementation is implemented in an Oracle 10gR2 database system manufactured by Oracle Corporation.
At the start 203 of the method, the primary database system is already using one of the available transport modes to provide redo to a standby database system. As the primary database system does so, the primary database system periodically executes loop 221. On each execution the primary database system determines whether a redo transport mode would at least potentially constrain the rate at which the primary database system is currently processing transactions. This redo mode will be termed in the following the measuring redo transport mode. In a preferred embodiment, the measuring redo transport mode is a synchronous transport mode; consequently, whether the measuring redo mode would constrain the rate at which the primary database system is currently processing transactions is determined using current network I/O latency currLAT(x) for the measuring redo transport mode (205). This is computed for x bytes of redo data as RTT(x)+IO(x), where RTT is the round trip time to send the x bytes of redo data from the primary database system to the standby database system and receive the confirmation for it in the primary data base system and IO(x) is the time it takes the standby database system to write the x bytes of redo data to the standby's redo log.
The primary database system then uses the value of currLAT(x) for the measuring redo transport mode to determine whether changing to a different transport mode for the redo data would be desirable (207). A change to a different transport mode is desirable if the value indicates that:
If currLAT(x) for the measuring redo transport mode indicates that no change in the transport mode is necessary, the loop is again executed after a wait period (209). Otherwise, branch 211 is taken and the primary database system determines whether a transport mode change is possible (213). If it is not, the loop is again executed as before (215). If so (217), the transport mode is changed to a more desirable transport mode (219). An example of a transport mode change that would be desirable but not possible would be a case where the current network latency would permit a change from an asynchronous to a synchronous transport mode but there is no standby database system currently available. In the preferred embodiment, there are only two transport modes and consequently, the method of flowchart 201 selects one or the other of these transport modes based on the current network latency for the SYNCH transport. In other embodiments, there may be more than two transports.
In the following, two techniques are described for determining whether a transport change is desirable. The first of these makes the determination on the basis of a parameter received from the database administrator which indicates a range of acceptable current network I/O latencies for the measuring transport mode. The second makes the determination on the basis of how much of the measuring transport mode's currently available bandwidth the primary would require at the primary's current rate of generating redo data. As described, the techniques are used with two transport modes; they may, however, be easily adapted to systems with more than two transport modes.
If the current network I/O latency for the measuring transport mode is larger than the maximum acceptable I/O latency (313), indicating that the measuring transport mode would constrain the primary database system, the primary database system determines whether a change to a faster transport mode is possible (309, 331). If it is (335), the change is made (337) and the loop is repeated; if not, the loop is simply repeated (333). If the current network I/O latency for the measuring transport mode is not greater than the maximum acceptable I/O latency, (311), the primary database system determines whether a change to a less risky transport mode is possible (321); if it is (325), the change is made (327) and the loop is repeated; if not, the loop is simply repeated (323). In embodiments with more than two kinds of transport modes for redo data, there could be a maximum acceptable I/O latency for each transport mode.
The scheme of
Then, at 409, whether a change in transport is desirable is determined from the value of the expression CRR(x)/MRR(x) (409). The larger this fraction is, the more likely it is that the speed of the measuring transport mode may constrain the primary database system; the smaller it is, the less likely. The decision whether to change the transport mode is made by establishing an upper bound and a lower bound for the value of the fraction. If CRR(x)/MRR(x) is greater than the upper bound, the measuring transport mode is taken to be constraining the primary database system; if it is less than the lower bound, the measuring transport mode is taken to be not constraining the primary database system. Consequently, in a preferred embodiment, if the fraction is above the upper bound, the transport should be changed to a faster transport if the current transport mode is the measuring transport mode; if it is below the lower bound, the transport should be changed to a less risky transport mode if the current transport mode is more risky. The logic for changing transport modes at 413-437 is identical with the logic of
Persistent storage 523 is storage such as disk drives which do not lose their data when powered down. In addition to primary database 543, persistent storage 523 includes system global area (SGA) 525, which contains data that is available to all of the processes that execute in server 503 and a number of on-line redo logs (ORL) 541(0 . . . n), one of which, ORL 541(i), is shown.
Components of system 501 which are of particular interest in the present context include certain processes of logging, backup and recovery processes 511, the data structure log_archive_dest 527 in SGA 525, and the current ORL 541. Beginning with the current ORL 541, current ORL 541 contains the most recent redo data generated by server 503. The redo data is written to current ORL 541 a buffer at a time. The next buffer to be written to current ORL 541 is termed in the following the current buffer. When system 501 is employing a synchronous transport to send redo data to the standby database system, the packets of redo data sent to the standby database system are copies of the blocks of redo contained in the current buffer and are sent to the standby database system immediately after the current buffer is written to current ORL 541. The next current buffer of redo is not written to current ORL 541, nor is acknowledgement of the write of the current buffer made to the generating application, until confirmation is received that the packet of redo sent to the standby database system has been written to the standby database system's redo log. The use of synchronous transport thus guarantees that the redo log in the standby contains an exact copy of the redo data written to the current ORL 541.
The first logging, backup, and recovery processes that is of interest is LGWR process 513, which writes buffers of redo data to the current ORL 541. The second set of processes that are of interest are LNS processes 512, which send packets of data across the network to the standby database systems. A LNS process may employ either the synchronous or asynchronous transport modes. In the case of the synchronous transport mode, the LNS process receives packets of redo data from the LGWR process after the redo data in the packets has been written to the current ORL 541 and sends each packet in turn to the standby, waiting until it has received the confirmation from the standby before signaling the LGWR to continue. When using the asynchronous transport mode, the LNS process simply reads blocks of data from the current ORL 541 and sends them by the fastest mode to the standby database system; there is no direct interaction with the LGWR process.
Data Guard processes 515, finally, is a set of processes that establishes a relationship between a primary database system and one or more standby database systems and then manages the relationship. A Data Guard operation which is important in the current context is changing the transport mode used by a primary database to transfer redo data to a standby database system without stopping and restarting either the primary database system or the standby database system. An important component process of data guard processes 515 is PING ARCH process 516, which periodically pings a primary database system's standby database systems to determine whether the standby is missing any redo generated by the primary. The pinging period for PING ARCH when it is used in this fashion is 1 minute.
The data structure log_archive_dest 527 in SGA 525, finally, contains an entry 529 for every database system which the data guard processes 515 have configured as a standby database system for the primary database system. The part of the entry which is of interest in the present context is a set of flags 539 which indicate the kind of transport mode being used to transport redo data to the standby database system represented by the entry:
ARCH 537 indicates that there is no connection between LGWR writing data to an ORL 541(i) and the reading of redo data from an archival redo log in primary database system to the standby. The transport mode specified by ARCH is used to send a copy of a non-current ORL 541 to the standby when the PING ARC detects a gap in the redo. Examples of situations which produce gaps in the redo data are if the standby has been down for a while or if logs got deleted before they were applied.
There are four parts to modifying an Oracle 10gR2 database system to implement the scheme of
All of the above are implemented by adding a new flag 531, SYNC_DOWNGRADED, to entry 529 and modifying PING ARCH 516. PING ARCH 516 now pings the standby every 10 seconds. PING ARCH 516 can of course determine the current network round trip time from its own pings and PING ARCH 516 is able to simply use the size of the buffers that LGWR 513 writes to the current ORL 541 to determine the average size of the packets written to the standby. The time it takes the standby database system to write a packet of data to the redo log is known from statistics maintained by the database system, and consequently, PING ARCH 516 can do the following every 10 seconds: collect the necessary statistics to compute MRR and CRR, average them using the sliding window, compute CRR(x)/MRR(x), and change the transport whenever CRR(x)/MRR(x) so indicates according to the current transport mode. When downgrading the SYNC transport to ASYNC for a particular standby database when CRR(x)/MRR(x) exceeds the upper bound percentage, PING ARCH 516 sets the SYNC_DOWNGRADED bit in the log_archive_dest_N structure corresponding to that destination then requests a log switch (a change to a new ORL 541) which will effect the change. Any SYNC destination with the SYNC_DOWNGRADED bit set will be treated internally as an ASYNC destination. When CRR(x)/MRR(x) drops below the lower bound percentage, PING ARCH 516 clears the corresponding SYNC_DOWNGRADED bit and again requests a log switch to effect the change.
In the preferred embodiment, MRR(x) always represents the maximum rate at which redo can be currently produced in synchronous mode. When the primary is operating in asynchronous mode, MRR(x) is computed from the writes which the primary makes to ORL 541 in this mode. The buffers which the primary writes to ORL 541 while it is operating in asynchronous mode are much smaller than those which it writes to ORL 541 in synchronous mode. The difference in buffer size must be taken account of by means of a scaling factor when MRR(x) is computed while the primary is operating in asynchronous mode.
The foregoing Detailed Description has disclosed to those skilled in the relevant technologies the inventors' techniques for automatically changing a database system's redo transport mode to dynamically adapt to changing workload and network conditions and has further disclosed the best mode known to the inventors of practicing their techniques. It will, however, be immediately apparent to those skilled in the relevant technologies that many implementations of the techniques other than the ones disclosed herein are possible. To begin with, the preferred embodiments are implemented in database systems manufactured by Oracle Corporation and employ the transport modes available in Oracle database systems, take advantage of the instrumentation available in Oracle database systems to determine whether a change of transport mode is desirable, and use the state available in the Oracle database systems to change the transport mode where necessary. Implementations in other database systems would similarly employ the transport modes, instrumentation, and state available in those database systems. Further, the preferred embodiment employs the techniques to switch between a transport mode that can potentially constrain the primary database system and one that cannot; the techniques can, however, be used to switch between transport modes for any reason at all. For example, a measuring transport mode could be used to determine whether a switch in transport modes based purely on risk of redo loss was desirable, or if the cost of a transport mode were an issue, a measuring transport mode could be used to determine whether a switch in transport modes based on cost was desirable.
Further, there are only two transport modes in a preferred embodiment; the techniques, however, can be employed to select among any number of transport modes. The techniques used to determine whether a current redo transport mode should be changed will of course depend not only on the database system in which the techniques are implemented, but also on the basis for switching transport modes. For all of the foregoing reasons, the Detailed Description is to be regarded as being in all respects exemplary and not restrictive, and the breadth of the invention disclosed herein is to be determined not from the Detailed Description, but rather from the claims as interpreted with the full breadth permitted by the patent laws.