1. Field
The following generally relates to database systems, and more particularly to propagating base table updates to views based on data derived from those base tables.
2. Related Art
Modern database systems comprise base tables that store directly updated data, and view tables that are derived from data obtained, directly or indirectly, from base tables (derived data). For example, a web store may use a base table for tracking inventory and another base table for tracking customer orders, and another for tracking customer biographical information. A person maintaining the web store may, for example, desire to analyze the data to prove or disprove certain hypotheses, such as whether a certain promotion was or would be successful, given previous order behavior, and other information known about customers.
The base tables are updated as changes are required to be reflected in the data. In other words, the base tables generally track or attempt to track facts, such as order placement, inventory, addresses, click history, and any number of other conceivable facts that may be desirable to store for future analysis or use. Thus, when base tables are updated, view tables that depend on data in those updated base tables ultimately should be updated to reflect those updates.
However, one concern is to avoid interfering with the main transaction systems involving applications making changes to the base tables, as the responsiveness of such systems generally affects a user's experience with the applications themselves. One conventional way to avoid slowing down systems interacting directly with users is to produce derived data (e.g., the view tables) “off-line”, so that the derived data reflects the status of the base data as of a certain update point. This approach has been acceptable because derived data was used mostly for analytics and business planning, and such uses did not require more up-to-date derived data.
Also, the amount of data generated in database systems continues to increase, as does the usage of derived data for a variety of purposes. Therefore, much greater demands are placed on systems maintaining derived data from base data updates. For example, current large scale database systems may need to handle hundreds of millions of view updates in a 24 hour period. Concurrent updating of derived data is increasingly necessary to keep up with these demands. However, concurrent updating of derived data cannot be undertaken without controls to avoid data corruption.
One approach that has been used to provide some measure of concurrency is provision of database systems that have strong Atomicity, Consistency, Isolation, and Durability (ACID) controls for each read and write transaction. Database storage systems providing such strong ACID transaction capabilities incur comparatively high overhead by implementing mechanisms to achieve these goals, and so make it difficult to scale such systems to the thousands of processing devices that would be used in such large scale database systems.
Thus, it would be desirable to provide, where possible, an approach that avoids causing view data inconsistencies, but does not incur the large overhead of using full ACID transactions. Preferably, such an approach also would be able to avoid extensive recovery procedures, such as rebuilding a view, to restore consistency.
The following disclosure relates to mechanisms providing controls for concurrent propagation of base table updates to aggregate views. A first aspect includes a database system method for concurrent view updating. The method can be executed in a plurality of view managers that are each operable to obtain base data updates and concurrently and independently propagate those updates to a view record. In the method, each view manager performs the update propagation by obtaining a respective base data update to be used in updating the view record, reading a value of the view record, as stored in a storage device, and obtaining a sequence number associated with the value when the value was read. Each view manager also performs an operation to produce a proposed update to the stored value, and submits the proposed update and the read sequence number to the storage device in a test and set transaction. Each view manager also determines whether a message received from the storage device, responsive to the submitting, indicates an update sequence error. If so, then each view manager returns to the reading step, and if no update sequence error is indicated, then each view manager treats its proposed update as committed and returns to the obtaining step for obtaining another respective base data update.
In formulating the received message, the storage device can be operable to compare the submitted sequence number with an update sequence number currently associated with the view record, and to form the message indicating the update sequence error if the submitted sequence number does not match the update sequence number currently associated with the view record.
Such methods also can include that each view manager is responsive to further errors indicated in the received message, including a record not found error. Responsive to such a message, each view manager can attempt to insert the view record, for which an update was attempted, with an initialized value. Such a situation may arise, for example, when maintaining a count view record. In some cases, the record may already have been inserted, such that the inserted can result in a record duplicate error, to which each view manager can respond by attempting to redo the method from the reading step. A variety of other variations can be provided for other types of aggregate views, which can include views for tracking one or more of sum, count, average, minimum, and maximum.
Another example aspect focuses on maintenance of views for extremas, such as a minimum or a maximum of an identified set of base data. In an example, a method comprises receiving a plurality of updates for a plurality of base records; and in each view manager of a plurality of view managers, performing operations comprising receiving one of the updates and attempting to read a current value of the view.
According to this example, if the attempt fails, the method includes attempting to insert the value from the received update, and if the attempt succeeds, then the method comprises receiving a sequence number associated with the received current value, and comparing the value from the received update with the read current value. The method also comprises, if the comparison indicates that the value from the received update sets a new extreme compared with the read current value, providing the value from the received update and the sequence number received by that view manager to the storage device. The method further comprises receiving a response to the providing; and if the response contains no error message, then the method comprises treating the update as committed. If the response includes an error message, then the method includes repeating the reading step.
In such a method, the received update can be for deleting the base record corresponding to the update, and the extreme value maintained can be equal to the received current value. The method can further comprise reading the values of the other base records to determine if another base record has a new extreme value, and if so then providing the new extreme value and the sequence number to the storage device.
Computer readable media and systems can be used in implementations as summarized above, and/or as described and claimed herein.
Log segments 130a-130n also can represent a path for providing results of queries made by view managers 110 of base records in storage units 120a-120n. For example, if a view manager conducts a read with a given key, then results can be considered to be provided through the logs. System 100 represents a generalization of any number of more particular implementations, which can include a variety of hardware and software resources for implementing storage of base records, view records, communications among storage components, computing resources for executing view manager operations and so on. View managers 110 can represent a plurality of threads of computing executing on one or more computing resources comprising, for example, many servers, potentially with multiple processors, each having multiple cores, thus representing a scenario where multiple view managers can be running concurrently, such that reads and writes conducted by those view managers may overlap in time.
Example update patterns that can be of concern include where two view managers 110 read or write a view record based on different base data updates. Such situations occur, for example, when an aggregate view, such as SUM, COUNT, and AVG, MIN and MAX views (i.e., a view that derive such information from base records) is maintained incrementally. In a more particular example, a view record can be maintained by a plurality of view managers incrementally, such that when a base record update to a given set of base records (e.g., adding a record to the set, deleting a record from the set, or updating a record in the set), a view manager propagates that base record update to the view. Scaling that situation up dramatically, many view managers ideally would be able to concurrently propagate base record updates to a view, while maintaining consistency of the view.
Examples for providing such consistency while increasing concurrency are described below.
A situation where an incorrect result can occur in a non-ACID situation is described with respect to
Each of view update program 605 and 610 would read the current value from the view table for each of the count of x1 (C1) and of x2 (C2) at a time when count of x2 is 2 and count of x1 is 1. Subsequently, each program 605 and 610 computes an update for each count. Program 605 writes its updated count x1 value back at T1, program 610 writes its count of x1 value back at T1 plus Δ, program 605 writes its updated count x2 value back at T1 plus 2Δ. Program 610 writes its count of x2 value back at T1 plus 3Δ. Thus, program 610 overwrites program 605 updates for both count of x1 and count of x2. In such a situation, the program 605 updates cannot easily be recovered, or reconstructed, and the view table ultimately ends in an incorrect state.
The following discloses various examples of how to implement aggregate updates concurrently and consistently on platforms that do not provide ACID transaction capability, but do provide Test and Set (TAS) functionality. TAS functionality can proceed as follows, using the example of
A first example is a flow for implementing insertion updates, as illustrated in
Method 700 includes determining (710) whether a record not found error message was returned. Such a message can be an indicator that a concurrently operating view manager deleted the record, because it decremented the count to zero, for example. If there was a record not found error message, then method 700 includes attempting to insert the record, with a count of 1; the sequence number can be initialized to 0, 1, or any arbitrary value. Method 700 also includes determining 725 whether a record duplicate error was returned in response to record insertion 715. If so, then this error message can indicate that another concurrent view manager already inserted the record, and if so, then method 700 returns to 705 to read the count, and the sequence number again. In the figures, an explicit showing of an return error code being assigned to the variable “Error”, which is later checked for certain error codes is omitted, but it would be understood that the variable error is assigned such codes, based on the operations undertaken in the specific example, and as shown in the pseudocode.
If there was no error detected at 725, then method 700 can complete 745, treating the update, which ultimately was effected by an insertion of a record as compared with incrementing a value in an existing record, as committed.
Returning to 710, if there was no record not found error, then the view record was read successfully, and an updated count value can be produced 712, and used in a TAS update to the view record 720, providing the updated count value and the sequence number read at 705. At 730, method 700 again checks whether there is a record not found error message (concurrent view manager could have deleted it), and if so, then method 700 returns to 705, because the count value attempted to be written would not be accurate. If the record was found, method 700 then checks for a TAS error 735, which if returned indicates that a concurrent view manager made changes, evidenced by mismatching sequence numbers between a sequence number stored by the storage device and one provided with the TAS update. If so, then method 700 returns to 705 to read again the then-current count and SN. Otherwise, method 700 can complete 745, treating the update as committed.
The pseudocode of Table 1 illustrates an example of how to propagate an inserted base record to an aggregate view, such as to count a number of base records having a certain quality. For example,
Method 800 includes reading a view record to obtain a count, and a sequence number associated with the count by a storage device (i.e., the sequence number associated with the currently stored count). Method 800 includes checking whether the count read is equal to 1 (810), which is an example minimum value for the example of decrementing by 1 (other minimum values can be used for different applications.) Note that in this example, method 800 does not need to check whether the record exists, as was done in method 700, because the particular base update being handled is an indicator that there still must be at least some count remaining to be decremented, so the view record would still exist.
If count is 1, then method 800 submits a delete TAS 815 for that record, supplying the sequence number obtained during reading, and in response 830 to a TAS failure (e.g., caused by mismatching sequence numbers), method 800 returns to reading 805 the view record again. If the count was not 1, then method 800 produces 819 an updated count (e.g., decrementing by 1), and submits 820 a TAS update with the updated count, and the sequence number. Method 800 also include checking for a TAS failure in response to the update TAS (e.g., mismatching sequence numbers), and for such an error, method 800 returns to reading 805. If no failure was detected at 830 for either 815 or 820, then the update can be treated as having been committed and method 800 is done 845.
As can be discerned for both method 700 and method 800, a number of iterations may be performed to propagate a given base table update to a given view record, if many view managers are concurrently processing different base table updates to that view record. This can be because, for example, another view manager will write an updated value while another view manager, even though having read prior, submits its update later. Although this looping causes some inefficiency, it is more scalable than traditional ACID mechanisms that are difficult to scale beyond systems with a hundred or so nodes. The present methods are expected to enable scaling to thousands and tens of thousands of nodes. The pseudocode of Table 2 relates to method 800.
The above description relates that view managers can implement operations according to three basic types, to account for different types of updates that may occur to a set of base records from which an aggregate view is derived. Variations on these operations are presented below for different types or categories of aggregate views.
Table 4, below, includes pseudocode for an example where view managers can be concurrently maintaining a sum for a group of base records, such that when a new base record is added to the group, one of the view managers would add a value from that new base record to the sum. Since the methodology is similar to that of count insert view updating, this pseudocode is described more briefly (also, pseudocode for Table 4 is also used in describing how an average may be maintained for a group of base records, as described below).
Table 4 shows that sum insert propagation includes reading a current sum, and a sequence number, checking whether the record was found or not, and if not found, then inserting the record, with a sequence number. The code is responsive to an error indicating that the insertion would cause a duplicate by returning to read the sum again. If the sum was there, then it is TAS updated with a value from the base record update being propagated. If the TAS update returns an error of either record not found or failure due to sequence number mismatch, the code rereads the sum, and if not then the update was successful.
A view maintaining an average can also be provided. One way to provide an average is to store a sum and a count for the data desired to be averaged, and calculate the average by dividing the sum with the count. In maintaining such an average, when a sum is updated for a new value, a count also would be updated, while if an existing value were revised, then the count would not be updated. The updates would be accomplished using the TAS update approach described above. Of course, if it desired to avoid explicitly calculating the average when the average is needed, the average also could be stored explicitly in a view record. In a still further alternative, an average and a count could be stored, and a sum could be calculated, when needed, based on the average and count. Thus, a view maintaining an average can be considered a usage both of updating a sum and a count value, according to the methodologies described below.
Sum delete propagation is analogous to count delete propagation, and pseudocode for sum delete propagation shown in Table 5 below can be understood by reference to the count delete discussion above. Analogous to the discussion of maintaining averages with respect to sum insert above, a count also can be maintained in sum delete pseudcode.
Sum update propagation is analogous to count update propagation, and pseudocode for sum update propagation shown in Table 6 below can be understood by reference to the count update discussion above. Analogous to the discussion of maintaining averages with respect to sum insert above, a count also can be maintained in sum update pseudcode. Sum update, like count update, can be used when both changing a value from one group to another. For example, if respective sums of salaries were maintained for two groups, and a person switched from one group to another (i.e., base data would reflect that the person switched from one group of base data to another), that base record update could be propagated to view records for each sum using operations according to the example of Table 6 pseudocode.
If a record not found error was not returned (1010), then the value returned in the read (MIN) is compared 1030 with the value of the base data triggering the update (here, identified as Y). If Y is less than MIN, then method 1000 includes test and set updating 1035 the view record with Y as the new MIN, which includes providing the sequence number read at 1005 to a storage device from which the MIN was read. If a TAS failure is returned (1040), then method 1000 returns to reading 1005, which as described above, indicates that another value was added, and for which the comparison at 1030 must be performed again, before updating MIN. In the absence of a TAS failure error, a record not found error also could be returned in the message responsive to the update attempt, and this condition is checked (1045). In the presence of a record not found error, method 1000 returns to reading (1005) the view record. Without either error condition (1040 or 1045), the update can be considered completed (1050).
Method 1000 was for a particular example of tracking a minimum. However, a converse maximum tracking method may be implemented by determining whether a stored maximum was less than a value indicated for a base record update, and if so then updating the maximum with that value. Table 7 illustrates MIN insert pseudocode.
Table 10 below illustrates pseudocode for a MAX insert update (e.g., an update to a view record caused by insertion of a base record). As evident, MAX insert parallels MIN insert, with appropriate changes for value comparisons.
Table 11 below illustrates pseudocode for a MAX delete update (e.g., an update to a view record caused by deletion of a base record). As evident, MAX delete parallels MIN delete, with appropriate changes for value comparisons.
Table 12 below illustrates pseudocode for a MAX update update (e.g., an update to a view record caused by updating of a base record). As evident, MAX update parallels MIN update, with appropriate changes for value comparisons.
In the above examples and other described aspects, methods and pseudocode were presented that would be implemented in a plurality of concurrently executing view managers. Each view manager can operate essentially independently, in that it can be responsible for propagating a given base record update to one or more appropriate view records, without explicitly coordinating, or being coordinated with the other view managers. By contrast, a full ACID transaction model operates using explicit coordination among entities seeking to update a given record. This system of explicit coordination is acceptable for some systems, but it does highly scale, since the explicit coordination overhead becomes too great as a number of participants in the system gets too large. In some examples, systems and methods according to aspects described are for use in systems having many thousands of view managers that can be updating many view records, where a plurality of view managers may be assigned to maintain larger view records.
Computer-executable instructions include, for example, instructions and data which cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. Computer-executable instructions also include program modules that are executed by computers in stand-alone or network environments. Generally, program modules include routines, programs, objects, components, and data structures, etc. that perform particular tasks or implement particular abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of the program code means for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps. Program modules may also comprise any tangible computer-readable medium in connection with the various hardware computer components disclosed herein, when operating to perform a particular function based on the instructions of the program contained in the medium.
Examples of how the disclosed methods and associated computer code transform a particular article into a different state or thing include that the particular article can include memories containing values tracking views (e.g., aggregate views, such as counts, averages and the like) that relate to base data, which can represent physical events (such as purchases, sales, inventory changes, objects, people's activities, such as logins, e-mails, and so on). The view(s) stored in the memories are updated as the base data changes, which transforms the memory into a different state. Also, a memory storing any given value is a legally distinct thing from a memory storing a different value; thus, the updating also makes the memory a legally distinct thing. Of course, it would be apparent from these disclosures that these merely are examples of such transformations. Further, embodiments disclosed herein can be implemented machines, including specific machines for maintaining such information, which can be called databases.
Those of skill in the art will appreciate that embodiments may be practiced in distributed computing environments where tasks are performed by local and remote processing devices that are linked (either by hardwired links, wireless links, or by a combination thereof) through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.