In the field of computing, data used by processors or other computing components may be held in memory devices and systems for time periods of different durations. Memory devices and systems that hold data in a transient manner are often referred to as “caches” or “memories”, whereas memory devices and systems that hold data in a persistent manner are often referred to as “storage”. Conventional memory systems involve a hierarchical arrangement of short-term and long-term memory devices.
In the present document the expressions “memory” and “storage”, and expressions derived from the verbs “to store” and “to hold”, may be used in an interchangeable manner and, absent additional qualification, do not connote any particular degree of persistence or transience in the retention of the data (e.g. they do not signify the use of any particular technology out of volatile and non-volatile technology).
The following detailed description refers to the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the following description to refer to the same or similar parts. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only. While several examples are described in this document, modifications, adaptations, and other implementations are possible. Accordingly, the following detailed description does not limit the disclosed examples. Instead, the proper scope of the disclosed examples may be defined by the appended claims.
Memory systems are being developed which comprise a plurality of logically connected memory units whose individual memory address spaces are aggregated and exposed (e.g. to processors/computing modules)—through entry points acting similarly to gateways in computer networks—as if the whole network of memory units was but a single memory component having a uniform memory space. Such memory systems are called “memory fabrics” in the present document.
These memory fabrics treat memory as if it were a routable resource (treating memory addresses somewhat in the way that IP networks handle IP addresses). The memory fabric handles memory traffic, that is, the items routed over the memory fabric and such items may comprise: memory access requests and other relevant messages/information to facilitate access, allocation, configuration and the like of the memory fabric, as well as data being read from/written to memory.
It is a challenge to implement data replication in a memory system that is a memory fabric of routable memory units.
Data replication is a technique that is used in the field of database management. Data replication involves creating multiple instances of the same data object, and keeping the various instances of the data object synchronized even when updates take place. As a result, even if one instance of the data object is corrupted, unavailable or distant, a computing component may nevertheless easily be able to access the data object by accessing one of the other instances of the same data object. In general, the expression “replica” is used for all the instances of the same data object (so, for example if there are m instances of the data object each of the m instances is called a “replica”). The same usage is employed in the present document.
Different approaches have been proposed for the management of data replication. One approach used with databases is so-called“eager” replication which synchronizes all the instances of a data object as part of the same database transaction. However, conventional eager replication protocols have significant drawbacks regarding performance and scalability, which are due to the high communication overhead among the replicas and the high probability of deadlocks. Some newer eager replication protocols try to reduce these drawbacks by using group communication.
Another approach used with databases is so-called “lazy” replication management. Lazy data replication decouples replica maintenance from the “original” database transaction. In other words, a first database transaction updates one of the instances, and then other transactions, that are required in order to keep all the replicas up-to-date and consistent, run as separate and independent database transactions after the “original” transaction has committed. A drawback of lazy replication techniques is that at a given moment certain replicas of a data object may have been updated but other replicas of the same data object may not yet have been updated (i.e. the replicas of the data object are not synchronized with each other).
So, both the eager and the lazy data replication techniques that have been proposed for databases have some advantages, but also various drawbacks.
Some techniques have been proposed to implement data replication in certain types of memory device and systems. (In general, the data replication techniques proposed for memory devices and systems are not the techniques known from work with databases in view of the fact that the latter techniques tend to operate at a coarse level of granularity, that is, the replicated data objects are large complex structures (often full database tables)).
For example, data replication techniques of both types (lazy and eager) have been proposed for use in centralized memory management systems.
As another example, in the case of a memory device built using phase change memory technology, in which individual bits in the memory can fail, it has been proposed that a page of memory that has flawed bits could nevertheless still be used for data storage if a backup of the same page of data is stored in a second page of memory that has pristine bits at the locations where the first page of memory has flawed bits.
As such, a technical challenge exists to implement data replication in memory fabrics, that is, in memory systems that comprise logically-connected, routable memory components whose individual address spaces are aggregated and exposed as a single memory component, to processing components and so forth that seek to read/write data in the memory system.
According to the example of
Similarly, the present disclosure is not limited having regard to the nature of the connections between the memory components 3; the connections between the memory components may be realized using different technologies including, but not limited to, optical interconnects, wireless connections, wired connections, and so forth. The connections in the network 2 may be of the same type or a mixture of different types. In a particular example implementation at least some of the memory components 3 are interconnected by high-throughput, low latency optical interconnects.
Each memory component 3 has capacity to hold data and so defines an individual memory-address space 5. The data-holding capacity of a given memory component 3 may be realized using one, two, or any desired number of constituent memory units (not shown in
The memory system 1 may comprise one or several additional components 8 that do not contribute to the memory space provided by the memory system 1, but which cooperate with the memory components 3 to facilitate set-up and/or operation of the memory system 1. One additional component 8 is illustrated in
According to the example of
The nature of the processing components Ext is not particularly limited; as a non-limiting example they may comprise processor cores, computing devices, and so forth. Although the processing components Ext are represented in
The memory system 1 may be configured to allow memory components to join and leave the network 2. Protocols may be provided to perform the appropriate configuration of the memory system when memory components join or leave the network 2. For example, there may be a protocol that operates when a memory component joins/leaves the network and adjusts the configuration of the memory address space that is exposed to external components Ext. Such a protocol may be implemented in various ways including, but not limited to: a peer-to-peer approach (implemented by peers provided in the memory components, as described in a co-pending patent application filed by the applicant), or an approach that uses centralized management to configure the memory space.
According to a particular implementation example, the memory system 1 comprises memory components 3 that provide persistent storage of data, for example, using non-volatile memory units, battery-backed volatile memory units, and so forth. In a specific implementation example such a memory system uses high-throughput, low latency optical interconnects. In this example implementation the memory system may serve as a non-hierarchical memory resource providing both short-term and long-term holding of data (i.e. acting somewhat like a massive DRAM that replaces all other memory needed to store data persistently (like non-volatile memory), or cache data (like DRAM)). Such an implementation flattens the memory hierarchy and removes the conventional differentiation between “disk” and “storage”, enabling processing nodes that use the memory system to employ simplified operating systems, giving faster access to data and reducing energy consumption.
Various protocols are implemented in the memory system 1 of
In the example of
In example memory systems according to the present disclosure a data “object” may be defined at various levels of granularity and may be written in the memory system 1 at a configurable number of memory positions: for example, a data object may correspond to a file held in a distributed manner over a set of memory addresses in the memory system 1, it may correspond to data written at a single memory position, and so on. In a specific implementation of the memory system the memory space is byte-addressable.
Memory systems according to the present disclosure perform data replication by implementing lazy data replication protocols in which replicas of a same data object are updated in individual transactions (not as part of the same transaction). In the memory system 1 according to the example of
The above-mentioned hardware, firmware or software 10 may include one or both of a processor and a machine-readable storage medium storing programming instructions or code executable by the processor to perform operations of the lazy data replication protocol. The above-mentioned hardware, firmware or software may include an application specific integrated circuit (ASIC) constructed to perform operations of the lazy data replication protocol.
Because the memory system 1 includes elements that cooperate to implement a lazy data replication protocol, an update of a first replica of a data object is decoupled from transactions which make the same update to other replicas of the same data object.
In the present document references to “updates” of a replica covers both the first time that a particular data object dj is written to the memory system and occasions when a change is to be made in a data object that is already held in the memory system. Further, in the present document references to “updates” of a replica cover the case where there is to be a rewrite of data defining the replica held at a particular memory address, and the case where each updated value/state of a replica is stored in the memory system in a separate memory address (such that all the successive values/states of the corresponding data object are stored in the memory system).
In the context of memory system 1 which comprises a network of memory components 3, the implementation of a lazy data replication protocol means that the updating of a replica of a data object may be completed (the update transaction may be executed) without needing to wait for news of the update transaction to propagate to all the other memory components that hold another replica of the same data object. This time-saving may have a significant impact on operation in the context where the memory system 1 comprises a physically-extended memory network in which there is a relatively significant distance between different memory components of the network 2 (e.g. the network of memory components extends over plural racks of a data centre, over plural data centres, and so forth). Thus, memory systems according to the present example have good scalability properties.
However, another challenge arises when a lazy data replication protocol is implemented in a memory system comprising a network of connected memory components, because the different replicas of a same data object held in the memory system can become unsynchronized, that is, at a given moment there may be replicas which have undergone the latest update and other replicas which have not yet undergone the latest update or, indeed, have not undergone a series of updates, or have been subjected to a series of updates implemented in the wrong order resulting in differences between this replica and other replicas of the same data object.
The receiving module 13 shown in
The memory system 11 is configured to implement a data-replication protocol in which for data objects written to the memory system, a primary replica PR of the respective data object is written in a first memory component 3a and each of a selected number of other replicas OR of the data object is written in a respective second memory component 3b.
An example method implementing a lazy data replication protocol in the memory system 12 of
In the data-replication implementation method illustrated in
The first memory component 3a updates the primary replica PR of the relevant data object (S102) in response to the update request, and notifies the update-synchronization manager 18 that the respective update has been implemented on the primary replica (S103). The first memory component 3a may use various techniques for generation of the update notification to be sent to the update synchronisation manager 18. For example, the first memory component 3a may relay to the update synchronisation manager 18 each update request it receives, performing the relaying when the relevant update transaction has been executed on the primary replica.
The update-synchronization manager 18 orders the updates relating to the primary replica PR, in the order that these updates were implemented on the primary replica PR (S104). This ordering may be achieved in various ways. As a non-limiting example, this ordering may be achieved by putting the update notifications received from the first memory component 3a into a first-in first-out register (FIFO). The update-synchronization manager 18 transmits, to the second memory component(s) 3b of the memory system, update instructions corresponding to the respective updates that have been implemented on the primary replica PR (S105). The transmitted update instructions are ordered according to the order of implementation of the respective updates on the primary replica PR. The update synchronization manager 18 may use various techniques for ordering the transmitted update instructions. As a non-limiting example, in the case where the update synchronization manager 18 has loaded received update notifications into a FIFO, the update synchronization manager 18 may transmit the update notifications, as update instructions, by reading them out of the FIFO. Of course other techniques may be used, for example, the update synchronization manager 18 may order updates by assigning sequence numbers or time stamps to the updates indicated in update notifications it receives and then transmit update instructions labelled using the sequence numbers/time stamps.
In response to receiving update instructions transmitted by the update synchronization manager 18, the second memory components update other replicas OR of the data object in the same order that updates were implemented on the primary replica (S106). This updating of the other replicas OR may consist in a first-time write of the other replicas, or an overwrite.
In the example lazy data replication method according to
In
As illustrated in
As illustrated in
Typically, the queueing unit 34 of a given second memory component 3b′ generates a local queue that may hold update instructions relating to replicas of different data objects, where those replicas are all written in the memory space of this second memory component 3b′. However, the queueing unit 34 may be configured to generate individual queues for the update instructions relating to replicas of different data objects.
The second memory components may implement transactions to apply updates from the relevant queue to replicas OR written in their respective memory spaces. The present disclosure is not limited having regard to the timing of implementation of these transactions. However, in a specific implementation of the second memory components, update instructions from the queue may be applied to replicas in the memory space 5 of the respective second memory component during idle times when no data-read or data-refresh transaction is being performed by this memory component (see below). By exploiting these idle times, the synchronization of replicas OR with their relevant primary replica PR does not interfere with (i.e. slow down) data-read or data-refresh operations in the memory system.
The list generator 40 may generate a global list (or global log) formed based on update notifications transmitted by some or all of the first memory components 3a in the memory system 11. Within the global list the details of the updates relating to a primary replica PR of a particular data object are ordered according to the order of implementation of updates on that primary replica PR. However, the present disclosure is not limited to the foregoing case. Thus, for example, the list generator may generate individual lists relating to the updating of primary replicas of different data objects.
The example first memory component 3a of
The memory system 61 comprises a receiving module 62 to receive data-read requests, for example from an external device EXT. The received data-read requests may include a parameter specifying a freshness-requirement for the data-object to be read. For example, the data-read request may identify the desired “freshness” of the data object in terms of a time stamp, a sequence number, and so on. The memory system 61 may handle the data-read request in a read transaction that reads a replica of the relevant data object. In certain implementations, the memory system 61 does not implement read transactions on primary replicas written in first memory components 3a. This approach reduces the workload on first memory components 3a, and thus reduces the risk that they might be overwhelmed by read requests (which would have had the detrimental effect of delaying or preventing the updating of primary replicas written in their memory space 5).
In certain implementations, the memory system 61 avoids performance of read transactions at first memory components 3a by assigning memory components to classes designating them either as “read components” or “update components”. First memory components 3a may only be “update components”, and execution of data-read requests received by the memory system does not comprise reading data that is written in update components.
The memory system 61 of
The present disclosure is not limited having regard to the location of the units that check freshness of read data and perform data refresh transactions. However, time savings may be obtained, and the architecture is simplified, in memory systems in which the second memory components 3b include elements to check the freshness of the data read from their memory spaces, and to refresh that data as desired.
Second memory component 103b of
According to this example, the data-refresh unit 164 comprises a queue-checking unit 165 to check whether the queue generated by the queuing unit 34 of this second memory component 103b comprises as-yet-unimplemented update instructions that would increase the freshness of a replica OR read from the memory space 5 of this second memory component 103b in response to a read request. The queue-checking unit 165 selects freshness-increasing update instructions from the queue. The selected update instructions may be all the update instructions relating to the replica of interest that are in the queue, or may be a sub-set of these, for example, only the update instructions relating to updates which bring the replica up to (and not beyond) the desired degree of freshness specified in the data-read request.
The data-refresh unit 164 further includes a selected-instruction-application unit 167 to apply to the replica OR the update instructions that have been selected from the queue by the queue-checking unit 165.
The second memory component 103b may further include an updated-freshness evaluation unit 168 to compare the freshness of a replica read in response to a read request and updated through application of update instructions by the instruction-application unit 167 against the freshness requirement specified in the read request.
The data-refresh unit 164 may further comprise an interrogation unit 169 to request the update-synchronization manager 18 to provide update instructions to increase the freshness of the replica read in response to the read request (in the event that the result of the comparison made by the updated-freshness-evaluation unit 168 shows that the replica OR still is insufficiently fresh even after application of relevant updates that had been pending in the local queue).
The present disclosure is not limited having regard to the structure of the memory components 3 and, specifically covers the case of memory components that comprise plural memory modules.
In the example of
In the example of
An example data replication method that may be applied in a memory fabric according to examples of the present disclosure will now be described. This example data replication method applies certain of the features described above and also combines various other features.
According to the present detailed example, all memory components in the memory fabric are classified into two different types: (1) read-only, or (2) update. Read only transactions are run only at read-only components, while update transactions are run only at update components. In the description that follows it is assumed that switch or router memory modules in the memory components control performance of the various described functions.
Update transactions consist of at least one write operation into a location of the memory fabric (the number of affected memory locations depends on the manner in which the data making up the data object is distributed over the memory fabric). Update transactions are performed on primary replicas of data objects. The changes of update transactions that occur at update components are serialised and logged in a global log. These changes are continuously broadcasted to all read-only components in the memory fabric and are queued in the local propagation queues of the read-only components: these local queues have the same structure as the global log.
According to the present example there are four types of transactions: update, read-only, propagation, and refresh transactions. An update transaction T may update an object “a” if T is initiated at the primary replica of “a”. T, however, may read any object replica in the fabric (update and read only components).
Upon updating a memory location with a unicast addressing request on the memory fabric, a memory router/switch controlling that memory entry triggers the updating of the other replicas of the relevant data object by relaying a similar request to update one or more other memory locations (depending on a “desired number of replicas” parameter that is included in the update request).
The present example is not limited having regard to how the memory locations for the other replicas are selected. As one example, if the memory fabric is partitioned into zones having different availability (e.g. latency) and these zones have the same size and architecture, then the other replicas may be held at memory locations whose relative location within their respective partitions is the same as the relative position within its partition of the memory location holding the primary replica. In that case the relayed update request then updates the equivalent memory location in the other availability zones.
Read-only transactions in turn may be initiated at any read-only memory switch/router. Their read operations may run at different read-only routers. This is an important generalisation and allows for arbitrary physical data organisations at the read-only routing of read-only operations.
Propagation transactions are performed during the idle time of a memory router in order to propagate the changes present in the local propagation queues to the secondary replicas of a data object. Therefore, propagation transactions are continuously scheduled as long as there is no running read or refresh transaction.
By virtue of this protocol, the propagation transactions for the same object are initiated from the same primary memory router. (In certain implementations each router may batch updates of different memory locations under its control.) As a result, at secondary routers all synchronizing updates of replicas of the same object are ordered by the order of the primary transactions that performed the corresponding updates on the primary replica at the primary router of the object.
Finally, in this example there are refresh transactions that bring the secondary copies at read-only memory routers to the freshness level specified by a read-only transaction. A refresh transaction aggregates one or several propagation transactions into a single bulk transaction. A refresh transaction is processed when a read-only transaction requests a version that is younger than the version actually stored at the read-only memory router. Upon a refresh transaction, the memory router first checks the local propagation queue to see whether all write operations up to the required timestamp are already there. If yes, it fetches these write operations from the local propagation queue and applies them to the appropriate memory locations of the read-only memory router, in a single bulk transaction. Otherwise, it may retrieve whatever is available in the local propagation queue and communicate with the global log for the remaining part.
To ensure correct executions at read-only memory routers, each read-only transaction determines aversion of the objects it reads at its start. In an implementation were the memory fabric is built on non-volatile memory components that provide persistent storage, old versions are stored in the fabric and the router just keeps track of the series of sequence numbers and where in the fabric that version has been stored. On the other hand, for space saving purposes, the router may overwrite the data stored in the current location and not make use of sequence numbers.
Although the present document describes various implementations of example methods, systems and components implementing data replication protocols, it will be understood that the present disclosure is not limited by reference to the details of the specific implementations and that, in fact, variations and adaptations may be made within the scope of the appended claims.
For example, features of the various example methods, systems and components may be combined with one another in substantially any combinations and sub-combinations.
Furthermore, in
In the present document, use of the bare expressions “component”, “module”, and “unit” to designate different entities should not be taken to signify any particular hierarchy among the designated entities.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2015/068792 | 8/14/2015 | WO | 00 |