The present disclosure relates generally to managing transaction logs and read caches in a database. In particular, the present disclosure relates to efficiently managing, duplicating, and migrating transaction log and read cache data in a key-value store.
Metadata stores are one example of a key-value store that store structural or descriptive information about other data (e.g., data stored in a large scale distributed storage system). In computer data storage systems, particularly large scale distributed storage systems, the metadata stored in a metadata store may contain information about the location or description of data stored in large scale distributed storage system. Metadata is important in data storage systems for locating and maintaining data stored in the data storage system.
Further, if a storage node or storage device of a metadata store fails, the metadata including difficult-to-recreate transactions may be permanently lost. Storing the metadata redundantly on multiple data storage nodes in a metadata store can aid in protecting against data loss due to storage device failure. This redundant storage, however, consumes extra processing and storage resources.
For a large scale distributed storage system, maintaining a transaction log of client interactions with data stored in the system aids in recreating the current state, or a prior state, of the metadata. However, maintaining the transaction logs for a highly accessed large scale distributed storage system consumes extra processing and storage resources. Further, replaying the transaction logs to recreate a current or prior state of the metadata can be slow and consume additional processing resources.
Further, maintaining the metadata can consume valuable processing and storage resources, especially when considering the scale of today's storage system. For instance, to allow users and administrators to better understand and manage their data files, a large amount of metadata, and thus storage resources, may be necessary to provide for effective searches over the metadata. With increased scale and storage resources consumed by metadata, time to access the metadata store for processing metadata queries and searches is unavoidably increased.
In view of the problems associated with managing large databases, such as key-value stores, in a storage system, one object of the present disclosure is to provide a highly accessible read cache for the database. This provides for quickly assessing the current status of the database. The read cache may be created based on transaction log entries. To provide a failsafe recovery mechanism, the created read cache may be duplicated to a separate node in the storage system.
Another object of the present disclosure is to migrate some information, such as transaction log data, from a local, fast access storage system to a more robust and cost-effective system, such as an additional secondary storage system, or even a distributed storage system. This migration minimizes the amount of data on fast, local storage for more efficient accessing and processing without affecting the primary functions of the database.
Still another object of the present disclosure is to generate a snapshot for a read cache. By generating a snapshot of a read cache, the covered transaction log may be intentionally deleted to save storage space in the storage system. In case of a read cache failure, instead of using the whole transaction log, a snapshot of read cache may be used to replace the covered transaction log entries in replaying the read cache. Through this approach, a potential large amount of storage may be saved.
These and other objects of the present disclosure may be implemented in a metadata store, that is further described below with a brief description of example system components and steps to accomplish the above and other objects for efficiently accessing and processing metadata. However, the techniques introduced herein may be implemented with various storage system structures database content.
The techniques introduced herein may include a method including: receiving a request from a client to perform a data transaction, updating a key-value pair in a metadata store based on the request, entering the data transaction in a transaction log, updating a read cache with the key-value pair, and replicating the last transaction log entry in at least one other storage node in the metadata store. Other aspects include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
The techniques introduced herein may further include one or more of the following features. The method where the data transaction includes storing, retrieving, updating, or deleting a data object. The method further includes copying a portion of the transaction log to a transaction log fragment object in a large scale distributed storage system. The method where replicating the last transaction log entry includes copying a portion of the last transaction log entry to at least one other storage node such that if one storage node fails there remains enough of the read cache on the remaining storage nodes to fully restore the read cache. The method further includes updating a read cache on a local storage device. The method further includes replicating the read cache on at least one additional local storage device. The method where the at least one additional local storage device includes solid-state drives. The method where replicating the read cache includes copying a portion of the read cache to the local storage devices such that if one local storage device fails there remains on the other local storage devices enough of the read cache to fully restore the read cache. The method where the last transaction log entry is stored on a local storage device. The method where the local storage device includes a hard disk drive.
The techniques introduced herein include a system having: a communication bus; a network interface module communicatively coupled to the communication bus; a storage interface module coupled to a storage device, the storage interface module communicatively coupled to the communication bus; a processor; and a memory module communicatively coupled to the communication bus, the memory module including instructions that when executed by the processor causes the system to receive a request from a client to perform a data transaction. The instructions further cause the processor to update a key-value pair in a metadata store based on the request. The instructions may also cause the processor to enter the data transaction in a transaction log. The instructions further cause the processor to update a read cache with the key-value pair. The instructions also cause the processor to replicate the last transaction log entry in at least one other storage node in the metadata store.
The techniques introduced in the present disclosure are illustrated by way of example, and not by way of limitation in the figures of the accompanying drawings in which like reference numerals are used to refer to similar elements.
For purposes of illustration, the techniques described herein are presented within the context of metadata stores. In particular, the techniques described herein make reference to metadata stores for a large scale distributed storage system. However, references to, and illustrations of, such environments and embodiments are strictly used as examples and are not intended to limit the mechanisms described to the specific examples provided. Indeed, the techniques described are equally applicable to any database using a transaction-like replication mechanism, or any system with state transactions and an associated read cache.
According to the techniques disclosed herein, a read cache comprises the current state of the metadata store. Maintaining a highly accessible read cache can make accessing the current state of the metadata much faster and more efficient than replaying the transaction log. A loss of the read cache due to storage device or node failure, or some other cause of data corruption or loss, could require the replaying of the transaction log from the beginning, or from some other known state in order to recreate the current state of the metadata. Duplicating the metadata read cache across multiple data storage nodes can mitigate the risk of loss of the current state of the metadata at the cost of additional processing and storage resources.
In one embodiment, the metadata store 102 is a key value store in which, for every key of a data object, data for retrieval of the data object are stored. The key may be the name, object ID, or other identifier of the data object, and the data may be a list of the storage nodes on which redundantly encoded sub blocks of the data object are stored and available. It should be apparent that other database structures (e.g., a relational database) may be used to implement the metadata store.
The metadata store 102 may be replicated on a group of storage nodes. In one embodiment, as depicted in the example of
As provided by the techniques introduced herein, the costs of data loss may be mitigated by only duplicating the TLOG 112 or read cache 110 across some of the available nodes in the metadata store 102. For example, the TLOG 112 and read cache 110 may be duplicated across a majority of the data storage nodes in the metadata store 102. In one embodiment, the speed of access to the read cache 110 and the most relevant portion of the TLOG 112 can be addressed by storing the read cache 110 and the tail (e.g., the most recent portion) of the TLOG 112 on fast storage. For example, the nodes of the metadata store 102 may include fast, but perhaps expensive and less-durable, local solid-state drives (“SSDs”) for storing the read cache 110 and portions of the TLOG 112. Solid-state drives can provide faster data access relative to spinning platter hard disk drives. However, some SSDs operate such that each storage location has a relatively limited number of write-cycles before that location on the SSD wears out.
In one embodiment, when there are three nodes in the metadata store as in the example of
While node 1106 is labeled as the master node in
The processor 212 can include an arithmetic logic unit, a microprocessor, a general-purpose controller or some other processor array to perform computations. The processor 212 is coupled to the central data bus 220 for communication with the other components of the system 200. Although only a single processor is shown in
The memory 214 can store instructions and/or data that may be executed by processor 212. The memory 214 is coupled to the central data bus 220 for communication with the other components. The instructions and/or data may include code for performing the techniques described herein. The memory 214 may be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, flash memory or some other memory device known in the art.
The client 103 may initiate transactions with the large scale distributed storage system 114. These transactions may alter the read cache 110a, 110b, and 110n (also referred to herein individually and collectively as 110). Client requests are logged in the TLOG 112. In some embodiments, the read cache 110 is stored on one or more nodes of fast local storage 301, which in some embodiments may be local SSDs. The most recent client transactions are stored in the tail of the TLOG 320a, 320b, and 320n (also referred to herein individually and collectively as 320). Like the read cache, the TLOG tail may be stored on a number of nodes, designated by the number “n” and duplicated to additional nodes by copying 321 the transaction log from one node to another.
Other segments of the TLOG 322, 324, and 326 (distributed across nodes a, b, n) may be stored, in some embodiments, on secondary local storage nodes 304, 306 and 308 in the secondary local storage array 302, which in some embodiments may be local HDDs. Segments of the TLOG may be copied 323 to parallel storage nodes.
Still other segments of the TLOG 342-348 may be stored as one or more data objects 340 in a large scale distributed storage system 114.
When the most recent TLOG entries in TLOG.tail 320 meet a certain threshold, they may be migrated to the secondary local storage nodes 304, 306, 308. The triggering threshold may be time-based trigger or when the TLOG.tail grows beyond a predetermined storage size-10 MB, for example. By moving the TLOG.tail to a secondary node 304, 306, 308 and designating the TLOG.tail as a new TLOG element 322a, an ordered sequence of TLOG files: TLOG.i+2322a, TLOG.i+1324a, TLOG.i 326a, etc., are accumulated on the secondary local storage.
In some embodiments, in response to a threshold being satisfied, the read cache may be copied as a read cache snapshot, for instance Read Cache Snapshot.i 310, to secondary local storage nodes 304, 306, and/or 308 and all other TLOG entries may be removed. Examples of such a threshold may include a time limit, a number of TLOG entries or a size threshold on the TLOG. After the read cache snapshot is created, subsequent TLOG entries may then be added to a new TLOG. Subsequently, when the read cache needs to restored, the restoration will take reduced time due to beginning with the read cache snapshot and appending the subsequent modifications from the new TLOG with a reduced size. In this embodiment, the storage capacity requirement of the secondary local storage nodes 304, 306, and 308 that hold the TLOG can be reduced.
As the limited nodes of the secondary storage of the metadata store approach capacity (or a counter or time-based threshold is used) the oldest segments of the TLOG, 326 and Read Cache Snapshot.i 310, for example, can be migrated 330 to a more robust and more cost-effective storage system such as the large scale distributed storage system 114, as illustrated. The migration 330 of the plurality of replicas of the TLOG entries 320, 322, 324, 326, and Read Cache Snapshot.i 310 in the metadata store 102 are replaced by a single entry in the large scale distributed storage system 114 which capitalizes on the robustness and efficiency of the large scale distributed storage system with its lower storage overhead and higher redundancy level. However, it is clear that alternative embodiments for a remote storage system, such as Network-Attached Storage or RAID Arrays, are also possible. In the event of a failure of an element of the metadata store, archived TLOG entries remain accessible by means of the reference to the Read Cache Snapshot.1310a at the end of the sequence of TLOG files in the metadata store 102 or the Read Cache Snapshot.i-1312 in the large scale distributed storage system 114.
In one embodiment, TLOG files are migrated 330 from the metadata store to the large scale distributed storage system 114 after all nodes of the metadata store are in sync for these TLOG files. For example, TLOG.i+2322a is first synchronized to node 1306 and all remaining nodes before the migration to the large scale distributed storage system 114 can take place.
In the preceding description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the disclosure. It will be apparent, however, that the disclosure can be practiced without these specific details. In other instances, structures and devices have been shown in block diagram form in order to avoid obscuring the disclosure. For example, the present disclosure has been described in some implementations above with reference to user interfaces and particular hardware. However, the present disclosure applies to any type of computing device that can receive data and commands, and any devices providing services. Reference in the specification to “one implementation” or “an implementation” means that a particular feature, structure, or characteristic described in connection with the implementation is included in at least one implementation of the disclosure. The appearances of the phrase “in one implementation” or “in some implementations” in various places in the specification are not necessarily all referring to the same implementation.
Some portions of the detailed descriptions above are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers or the like.
It should be borne in mind, however, that these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or “displaying” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other information storage, transmission or display devices.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modems and Ethernet cards are just a few of the currently available types of network adapters.
Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear from the description below. In addition, the present disclosure is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the disclosure as described herein.
Finally, the foregoing description of the implementations of the present disclosure has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the present disclosure to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the present disclosure be limited not by this detailed description, but rather by the claims of this application. As will be understood by those familiar with the art, the present disclosure may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. Likewise, the particular naming and division of the modules, routines, features, attributes, methodologies and other aspects are not mandatory or significant, and the mechanisms that implement the present disclosure or its features may have different names, divisions and/or formats. Furthermore, the relevant art, the modules, routines, features, attributes, methodologies and other aspects of the present disclosure can be implemented as software, hardware, firmware or any combination of the three. Also, wherever a component, an example of which is a module, of the present disclosure is implemented as software, the component can be implemented as a standalone program, as part of a larger program, as a plurality of separate programs, as a statically or dynamically linked library, as a kernel loadable module, as a device driver, and/or in every and any other way known now or in the future in the art of computer programming. Additionally, the present disclosure is in no way limited to implementation in any specific programming language, or for any specific operating system or environment. Accordingly, the disclosure of the present disclosure is intended to be illustrative, but not limiting, of the scope of the present disclosure, which is set forth in the following claims.