The present embodiments may be better understood, and its numerous objects, features, and advantages made apparent to those skilled in the art by referencing the accompanying drawings.
The use of the same reference symbols in different drawings indicates similar or identical items.
The description that follows includes exemplary systems, methods, techniques, instruction sequences and computer program products that embody techniques of the present invention. However, it is understood that the described invention may be practiced without these specific details. For instance, although depicted examples refer to protocol stack instances that correspond to the OSI reference model, embodiments are not limited to any particular implementation of protocols and may be realized in accordance with any of a variety of stack models, such as the TCP/IP four layer model. In addition, the description often depicts an example with an offload engine that performs many of the operations for handling a data request and maintaining data coherence. It should be understood that the amount of tasks performed by an offload engine may vary from a minimalist approach, perhaps entrusting the offload engine with operations that do not involve message generation, to a more substantial offload of tasks. In other instances, well-known instruction instances, protocols, structures and techniques have not been shown in detail in order not to obfuscate the description.
The following description refers to a data unit, a shared-cache cluster, a primary processing unit, a instantiated code and a data coherence offload engine. A data unit refers to transmission and/or request granularity of data, such as a block or a file. A primary processing unit refers to a processing unit (or possibly multiple processing units), whether having a single core or multiple cores, that acts as a primary processing unit for a node. For example, a node may have a set of one or more processing units responsible for core tasks and another processing unit responsible for encryption and decryption. The encryption/decryption processing unit would not be considered the primary processing unit in such an example. An instantiated code is realized as one or more pieces of code, executed by one or more processing units, that performs one or more tasks. Examples of an instantiated code include a process, a thread, an agent, application (e.g., a database) binary, operating system binary, etc. A data coherence offload engine is a mechanism, such as programmable logic, substantially implemented with hardware distinct from the primary processing unit of a node. Although a portion of the task(s) performed by the offload engine may be realized with execution of instruction instances, a substantial portion of the task(s) is performed by one or more hardware components. Example offload engines may comprise one or more of an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a complex programmable logic device, etc. Regardless the implementation specifics, a simple offload engine (hardware with less capability and structures than a processor) allows for transfer of tasks without the high cost of a processor. A shared-cache cluster is a cluster of nodes with memory that collectively act as cache for back-end storage. A cluster may be realized as a network of systems or a network of processing units. For example, nodes may be individual systems. The dynamic random access memory of these individual systems collectively act as a cache for back-end disk-based storage. In another example, nodes are individual chips with shared cache for a shared random access memory based storage. Regardless of the particular realization, a shared-cache cluster provides faster access to data in storage.
In the node A 101, the instantiated code 103 initiates a data unit request. The data unit request is communicated to the data coherence offload engine 107. Any of a number of techniques can be employed to facilitate communications between executing code on a primary processing unit and a data coherence offload engine. For example, communication may be facilitated between an instantiated code and the offload engine with a queue. In such an example, instantiated code pushes requests to the tail of the queue while the offload engine pops requests from the head of the queue for processing. It should be appreciated by those of ordinary skill in the art that reference to pushing and popping to a queue is merely illustrative and not meant to be limiting upon embodiments. A variety of structures, whether implemented in one or both of hardware and software, may be implemented to facilitate communication. The data coherence offload engine 107 determines the node in the cluster that hosts the directory. In some implementations of a shared-cache cluster, the requestor node may host the directory or a portion of the directory. If node A 101 hosts a portion of the directory, then the data coherence offload engine 107 consults the local directory portion to determine if it indicates a location of the requested data unit. If the location of the data unit cannot be determined at the requestor node A 101, then the data coherence offload engine 107 generates a data request and transmits the data request to node B 111 via the interconnect adaptor 105. The offload engine 118 at the node B 111 consults the directory 119 to determine a location of the requested data unit. In this illustration, the directory 119 indicates that the location of the requested data unit is node C 115. The offload engine 118 forwards the data unit request to the node C 115. The data unit request is received at the interconnect adaptor 119, and then forwarded to the offload engine 120. The offload engine 120 accesses the memory 121 to retrieve the data unit. If the operations become too complex for the offload engine, then the offload engine defers to the instantiated code on the corresponding primary processor unit. For example, if the node B 111 receives a batch of requests or certain requests related to database recovery operations, then the offload engine forwards the requests to the instantiated code 113 on the primary processing unit 128.
The request travels up the protocol stack instance 303 to the network layer, where an offload engine receives the request and locates a current owner node for the requested data unit. After determining a current owner node, the offload engine forwards the request to the current owner node. After the request travels down the protocol stack instance 303, it is transmitted to a node that corresponds to the protocol stack instance 305.
The forwarded request travels up the protocol stack instance 305 and is received by an offload engine at the network layer of the protocol stack instance 305.
Example Configuration
When a message with a requested data unit arrives at a receiver, such as receiver 401a, the receiver parses the message into a header and the requested data unit. The header is written into one of the virtual channel queues 403a. The data unit is written to the data queue 411a. The header traverses the one of the virtual channel queues 403a until eventually being selected by the multiplexer 405a and written into the header register(s) 407a. Eventually, the header is received by the data coherence offload engine 450 via the multiplexer 409. The data coherence offload engine 450 processes received headers (e.g., ACKs) and informs the instantiated code that requested data has arrived and is available in the memory. For example, the data coherence offload engine sets a flag at a location polled by the instantiated code; the offload engines causes generation of an interrupt, etc. The data unit travels through the data queue 411a until arriving at the RDMA unit 413a. The data unit is then written directly to memory by the RDMA unit 413a via the multiplexer 415.
Shared-Cache Cluster Coherence Transaction Types
The various illustrations have assumed presence of a requested data unit in the cluster without consideration of the state of the data unit and some transaction performed to maintain coherence. The following are example transactions employed in a shared-cache cluster architecture to maintain coherence.
Of course, the directory node and the current owner node or one of the sharer nodes could be one and the same. Also, the requester node and one of the sharer nodes could be one and the same.
Unless message dropping is allowed (not recommended if large performance swings are not acceptable), the different transaction types might be delivered to different input queues (different virtual channels) in receiving nodes in order to prevent deadlock.
Shared-Cache Cluster Message Types
The following illustrates example message types for implementing the above example message transactions. As stated previously, the division of labor (i.e., functionality) between the offload engine and a code executed by another hardware unit, may vary across embodiments.
Data Unit Request Message
A data coherence offload engine retrieves a desired data unit ID requested by the application and, if performing a read, determines if the data unit is in local memory. If it is, no transaction is needed (the offload engine informs the requesting application database instance that the data unit is available to be read from local memory). If the data unit is requested to perform a write or the data unit is not available in local memory, the data unit's directory is consulted. The data coherence offload engine uses the data unit ID to determine a directory ID, which identifies the directory node for that data unit.
Directory Data Unit Request Message from Requester Node to Director Node
A data unit request message is forwarded to the directory node as a directory data unit request message, in order to get the state of the requested data unit. For this illustration, the directory node (like all the nodes) is assumed to have receiver logic as shown in
Forward Message from Directory Node
The “forward” messages are messages forwarded by the directory node to the owner node or, in the case of a shared data unit, to a provider node. There are, as briefly described already, several types of “forward” messages.
In our examples, a data unit's owner/provider is assumed to have receiver logic as shown in
Response Message
A node issues one of the following types of response messages after receiving a request.
If a requester node receives a “Shared Data” message, then the offload engine writes the data unit (i.e., the data payload) into the appropriate memory area. In this example, the data unit shouldn't be marked as private in the requester node. The directory node will indicate whether the data unit is private.
If a requester node receives a “Private Data” message, the offload engine writes the data unit into the assigned memory area. Once the data unit is in memory, the application database instance can perform either a read or write of the data unit.
If PRIVATE_ACK is received by the requester node, then the previously shared data unit is now private and can be modified (the data unit ID was mapped to the proper data unit memory address and data coherence offload engine has the data unit's address in memory). The data unit shouldn't be marked as private in the requester node, because the directory node is responsible for updating state of the data unit. However, marking the block as private locally will allow further modification without invoking the directory node. No ACK needs to be sent to the directory node by the requester node in this case.
For the above example scenarios, the data coherence offload engine handles a received message (e.g., “Read Private” and issues an INV_ACK) or interrupts the requesting application database instance, in order to inform it that the data unit is already available. For starvation reasons, it is probably the requester node that should send the “inform” message to the directory node.
Inform (ACK) Messages
The inform messages include the following:
The INV_ACK and PI_ACK messages are received by the directory node in response to INV and PI messages, if such messages were issued by the directory node. The directory node also receives the Shared_Data_ACK or Private_Data_ACK messages, depending on the type of the request. The offload engine handles the “inform” messages and examines the “inform” message's header to invoke the proper code, based on the message type. This invoked code determines if all “forward” messages issued for a given data unit request are acknowledged, indicating a coherent state for the data unit in all the nodes. When the last ACK is received, the data coherence offload engine in the directory node changes the data unit's state according to the request and the previous data unit state and unlocks that data unit, allowing subsequent requests for that data unit to proceed.
Example Scenarios
The examples assume three database application instances that read and update the same data unit in a given sequence. Each of the three instances run on different nodes (Node 1, Node 2 and Node 3), with a directory node for that data unit running on a fourth node. The examples refer to block realizations of data units and to packet realizations of messages.
Block Read the First Time by Instance 3
A database instance running on Node 3 identifies a need for a block. It requests the block from the offload engine located on Node 3. The offload engine at Node 3 (the requester node) issues a Read_Shared to the directory node. An offload engine of the directory node examines the header, looks up the block ID and determines that no node has a copy of the block (the block is on a storage disk only). As a result, the data coherence offload engine assigns an entry for the block in the table (hashed or otherwise) that holds the block ID translation to the memory address where the directory (state, etc.) for the block is held. The data coherence offload engine locks the entry and waits for an ACK from the requester node that the requester node has received a copy of the block. The data coherence offload engine also issues an I/O request. When the data becomes available, data could be sent as a “Shared Data” packet to the requester node (Node 3). The data coherence offload engine of Node 3 handles the response header, and writes the block's data into the pre-assigned memory area. When the data is transferred to memory, the data coherence offload engine interrupts the Node 3 database application instance that issued the data block request, indicating that the block is available. The requesting database application instance issues (or causes the offload engine to issue) a Shared_Data_ACK packet to the directory node, before proceeding with using the block's data. The directory node's offload engine handles the Shared_Data_ACK packet by updating the block's state to shared local and unlocking the block.
Block Read by Instance 2
The database instance at Node 2 requests to read data. The offload engine at Node 2 (the requester node) issues a Read_Shared to the directory node. The offload engine of the directory node examines the header and looks up the block ID in the directory table. If the block is locked, the offload engine could go to the next queue, while keeping a current queue as the highest priority queue. Otherwise the data coherence offload engine determines that Node 3 has a copy of the block and locks the block in the directory table while waiting for the ACK from the requester node (Node 2) that it got a copy of the block. The data coherence offload engine also issues a “Read_Shared” request to Node 3 (the block provider). When the data becomes available, that node could send it as a “Shared Data” packet to the requester node (Node 2), while keeping a copy of the block. The data coherence offload engine handles the response header, and writes the block's data into the pre-assigned memory area. When the data is transferred to memory, the data coherence offload engine interrupts the Node 2 offload engine that issued the data block request, indicating that the block is available. The data coherence offload engine issues a Shared_Data_ACK packet to the directory node, before the database application instance proceeding with using the block's data. The directory node's offload engine handles the Shared_Data-ACK packet by updating the block's state to shared local for Node 2 (in addition to shared local for Node3) and unlocking the block.
Block Update by Instance 2
The database instance at Node 2 identifies interest in a block. The offload engine at Node 2 detects that is has a block of interest, but does not know if it owns the block of interest. So, the offload engine at Node 2 issues a Read_Private to the directory node. The offload engine of the directory node examines the header and looks up the block ID in the directory table. If the block is locked, the offload engine could go to the next queue, while keeping this as the highest priority queue. Otherwise the data coherence offload engine examines the block's directory and locks the block in the directory table and sends a PRIVATE_ACK to the requester node. The data coherence offload engine determines that the block is “local” and shared by Node 2 and Node 3, so the copy in Node 3 should be invalidated. As a result, the data coherence offload engine issues an “INV” request to Node 3. The offload engine in Node 3 examines the packet header and, as it is INV, invalidates the block with that block ID in its table, before issuing an INV_ACK to the directory node. The offload engine of Node 3 invokes the proper code to handle the response (INV_ACK) header. As no data transfer takes place, the data coherence offload engine at the requester node interrupts the Node 2 database application instance that issued the data block request, and indicates that the block is available and can be modified after receiving the PRIVATE_ACK from the directory node. The requesting database application instance issues (or causes the offload engine to issue) a Private_Data_ACK packet to the directory node, after updating the block's data. The directory node's offload engine handles the Private_Data-ACK packet by updating the block's state to exclusive local for Node 2 and unlocking the block. In another approach, the data coherence offload engine at the directory node waits to send the PRIVATE_ACK to the requester node until receiving INV_ACKs from all sharer nodes.
Block Update by Instance 1
The offload engine at Node 1 requests a block as private, in order to modify it, by issuing a Read_Private request to the directory node. The offload engine of the directory node examines the header and looks up the block ID in the directory table. If the block is locked, the offload engine could go to the next queue, while keeping this as the highest priority queue. Otherwise the data coherence offload engine examines the block's directory and locks the block in the directory table while waiting for the Private_Data_ACK from the requester node that it got a copy of the block. The data coherence offload engine determines that the block is “local” and owned by Node 2, so it issues a “Read_Private” with the global bit asserted, indicating that Node 2 should send the data to the requester node, while keeping a copy of the block as PI. The offload engine in Node 2 examines the packet header and, as it is “Read_Private” with the global bit asserted, it marks the block with that block ID as PI in its table and reads the block's data from memory into the “Private_Data” packet it sends back to the requester (Node 1). The offload engine of Node 1 invokes the proper code to handle the response header, while the data coherence offload engine writes the block's data into the pre-assigned memory area. When the data is transferred to memory, the data coherence offload engine interrupts the Node 1 database application instance that issued the data block request, indicating that the block is available. The database application instance issues (or causes the offload engine to issue) a Private_Data-ACK packet to the directory node, before proceeding with using the block's data. However, the Private_Data-Ack may also be delayed until after using the block's data to address concerns of starvation. The directory node's offload engine handles the Private_Data_ACK packet by updating the block's state to exclusive global for Node 1 and global null for Node 2 and unlocking the block.
The described embodiments may include a computer program product, or software, that may include a machine-readable medium having stored thereon instructions, which may be used to program a computer system (or other electronic devices) to perform a process according to the present embodiments. A machine readable medium includes any mechanism for storing or transmitting information in a form (e.g., software, processing application) readable by a machine (e.g., a computer). The machine-readable medium may include, but is not limited to, magnetic storage medium (e.g., floppy diskette); optical storage medium (e.g., CD-ROM); magneto-optical storage medium; read only memory (ROM); random access memory (RAM); erasable programmable memory (e.g., EPROM and EEPROM); flash memory; electrical, optical, acoustical or other form of propagated signal (e.g., carrier waves, infrared signals, digital signals, etc.); or other types of medium suitable for storing electronic instructions.
While the invention has been described with reference to various realizations, it will be understood that these realizations are illustrative and that the scope of the invention is not limited to them. Many variations, modifications, additions, and improvements are possible. More generally, realizations in accordance with the present invention have been described in the context of particular realizations. These realizations are meant to be illustrative and not limiting. Accordingly, plural instances may be provided for components described herein as a single instance. Boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of claims that follow. Finally, structures and functionality presented as discrete components in the exemplary configurations may be implemented as a combined structure or component. These and other variations, modifications, additions, and improvements may fall within the scope of the invention as defined in the claims that follow.