The present invention relates generally to a coprocessor for improving internode communications in a cluster and more particularly to a coprocessor that handles a link-to-link protocol for improving internode communications.
Individual processing systems have greatly increased in performance. However, still greater performance is attainable by clusters of processing systems or nodes. A key factor in attaining high performance clusters is communication among the nodes.
Recognizing the disadvantages of the shared bus architecture, another technique, depicted in
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
An embodiment provides internode communication in a cluster that has very low overhead and permits direct memory-to-memory communication among the nodes, each residing in a separate physical address space. The embodiment improves both communication latency and bandwidth and provides hardware authenticated access among the physical address spaces and error checking. The embodiment increases performance of the cluster, permitting it to act much more like a single system. The embodiment also permits a higher number of nodes because the performance scales with the number of nodes.
The embodiment makes it possible to incorporate high speed non-volatile memory, such as PCM (phase-change memory) or NVRAM, local to the node and to share the memory in a distributed cluster environment at high bandwidth.
One embodiment is a computer system that includes a plurality of computing nodes and a plurality of point-to-point physical communications links. Each of the computing nodes of the plurality of computing nodes includes a coprocessor and a memory coupled to the coprocessor, where each memory resides in a separate and distinct physical domain. One or more communications links of the plurality of links is coupled between each pair of nodes in the plurality of nodes, where each coprocessor in a node is coupled to the one or more communications links to transfer data over the at least one communications link. Each coprocessor is configured to transfer data between the memory coupled to the coprocessor and the memory of another node to which the coprocessor is coupled by the one or more communications link using a certificate that grants access to a portion of the memory in the other node, or to transfer data between two other nodes in the cluster to which the coprocessor is coupled by the one or more communications links using a first certificate that grants access rights to a portion of memory in the first of the two other nodes and a second certificate that grants access rights to a portion of memory in the second of the two other nodes.
In the drawings:
In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.
Node Architecture and Coprocessor Interfaces
The memory interface 416a-b couples the coprocessor 406a-b to a hypervisor interface 452 and to a user interface 454, both of which reside in physical memory 418a-b.
The data transfer machine 466 in the coprocessor 406a-b is coupled to the physical memory 418a-b through the memory interface 416a-b and performs data transfers from or to any physical link to which the coprocessor 406a-b is coupled and the physical memory 418a-b without involving the processor 412a-b in the node.
Hypervisor Interface
The hypervisor interface 452 of a particular node depicted in
The command queues 462 contain commands, each in the form of a coprocessor control block (CCB) described below. In one embodiment, the command queues 462 are circular queues (also known as ring buffers) and contain a maximum of 16 entries. In one embodiment, a coprocessor 406a-b supports eight command queues and one priority queue for the streaming pipeline 456.
The coprocessor status registers csr 464 provide information for configuring the coprocessor 406a-b and for synchronizing the coprocessor 406a-b. The hypervisor 460 has access to one or more of the internal coprocessor status registers csr 464, such as the doorbell register, which the hypervisor 460 uses to synchronize the issuance of commands to the coprocessor 406a-b, and status registers to configure the coprocessor 406a-b at startup and to read error logs and status.
User Interface
The user interface 454 supports user code area 470, which provides one or more data areas 472 for data movement commands and a completion data structure 474 for command results. The completion data structure 474 is a common area assigned by the hypervisor managing the coprocessor. The location of the common area is communicated to the OS and is visible to all software that has access to the area. Each data area 472 used by the coprocessor 406a-b for input and output data is accessible for data transfers by any of the coprocessors present in the physical domain, and resides permanently (i.e., is immune from being paged or swapped out of physical memory) in a physically contiguous area of memory 418a-b. Input and output data areas for remote nodes require an RKey to be accessed.
The coprocessor updates the completion data structure 474 at the end of a command to resynchronize with the hypervisor 460, which has access to the completion data structure 474. The completion data structure 474 is used for the following functions, signaling completion to the user, transmitting the command return value to the user, establishing flow control with the user, signaling user visible errors, logging user visible command statistics.
Streaming Pipeline
The streaming pipeline 456 is multi-threaded and executes commands to move data from one memory location to another memory location. Either the source or destination or both may be non-local, i.e., in a remote node. The streaming pipeline also assists other remote streaming pipelines in moving data.
Command Scheduler
The command scheduler 458 schedules commands for the available threads of the streaming pipeline 456 assuming that the commands will be executed in parallel. In one embodiment, serializing flags cause two commands to be executed sequentially.
Physical Domains
Each node 402, 404 in
A remote key (RKey) is associated with a window of a memory region that has an LKey. Each LKey can include one or more RKeys and associated memory regions. The RKey grants remote access rights from one given local key in a physical domain to another local key in a remote physical domain. The remote user of a portion of memory protected by an RKey presents the RKey to access that portion of memory. The coprocessor, upon receiving the RKey, validates the key and if the validation succeeds, proceeds with the command.
In one embodiment, an RKey includes the following data items.
RKey={Hash,Size,PA,SecretNo,Flags}, where
Hash={Encrypt(Size,Flags,Address,SecretNo)}; Encrypt could be any of the popular and fast encryption schemes; PA is the physical address associated with the RKey; Size is the size of the region in which the RKey is valid; Address contains the physical address in the remote physical domain; SecretNo is one of sixteen numbers used to generate the Hash; and Flags indicate whether the memory region is readable, writable, or a cache update.
In one embodiment, an RKey can cover a memory region as small as 1 KB or as large 1 TB, as specified in the Size field.
Key Setup
A centralized configuration program in the cluster oversees the set up of all source and destination keys needed for the operation of the cluster. At initialization time, the centralized configuration program sends to the hypervisor in each node, a command that specifies an LKey, an RKey, which resides in the LKey, the size of the region that the RKey covers, and the access mode that specifies readable or writeable, or cache update. Upon receiving the user command, the hypervisor creates the RKey, performs encryption to create the hash, and populates the RKey table in the physical memory of the node in which the hypervisor operates. The node that owns the memory containing a particular RKey table is called the “home node” for the keys in the table. A “home node” sends its RKeys to other nodes through the network so that the node can participate in data transfers. In operation, the RKey table in its home node is used by the coprocessor in the home node to validate a received RKey and to translate the RKey to a physical address.
In one embodiment, an RKey is created by the command rkey_create (lkey, off, len, tt1, mode), where 1 key is the LKey that contains the newly created RKey, off is an offset into the RKey table for translation, len is the size of the newly created regions, tt1 is the time to live parameter, and mode is the read or write access mode. The time to live parameter tt1 limits the life of the key for added security. After a key expires, rights granted to the region covered by the RKey are revoked and access to the same region requires a new key. Not only does the time to live parameter help maintain security, the RKey table itself also does. In particular, to secure a node from receiving outside transfers, the node can invalidate its own RKey table. The invalidated table causes all transfers to the node with such a table to fail validation. Each attempted transfer receives a negative acknowledgment indicating that the validation failed.
Coprocessor Command Queue Operation
The coprocessor operates in the physical address space so that any commands sent to the coprocessor contain only physical addresses. All data structures visible to the coprocessor reside in contiguous physical locations and are expected to stay resident in memory (i.e., not be swapped or paged out).
Coprocessor Command Execution
When the coprocessor receives a command from the hypervisor, it executes the command asynchronously with the thread in the multi-threaded hypervisor that issued the command. If the hypervisor sends multiple commands, the coprocessor schedules them to be executed in round-robin fashion. The coprocessor can execute some commands in parallel.
In an alternative embodiment, the initiator sends the destination RKey, with a forwarding instruction, to the source endpoint in PDOM 1. The source endpoint still validates the source RKey and uses the validated source RKey to access the requested data, as in steps 804, 806, and 808 of
Coprocessor Commands
A coprocessor supports a variety of data movement and maintenance commands. The data movement commands include copy type, fill type, store type, compare type, and modify type commands.
The copy type commands move data from a source address to a destination address or immediate data to the destination address. If the source or destination address is not local, then an RKey specifies the destination address. When the command completes it posts a result in the completion data structure.
The fill type commands take an immediate data value and use it to fill the data value starting at a destination address. If the destination address is not local, then an RKey specifies the destination address.
The store type commands take an immediate data value and store it in the destination address. If the destination is not local, then an RKey specifies the destination address.
Compare type commands take a compare value and a swap value. The command compares contents at a destination address with the compare value. If the two are equal, then the command writes the swap value into the destination contents and returns the old contents of the destination. If the destination is not local, then an RKey specifies the destination address.
Another type of compare command takes an immediate value and compare value, and compares the compare value with the destination contents. If the compare value is strictly larger, then the command updates the destination contents with the immediate value and returns the old contents of the destination. If the destination is not local, then an RKey specifies the destination address.
The modify type commands take an immediate value and add the value to or OR the value with the contents of a destination address and return the old contents of the destination. If the destination is not local, then an RKey specifies the destination address.
Hardware Overview
According to one embodiment, the techniques described herein are implemented by one or more special-purpose computing devices. The special-purpose computing devices may be hard-wired to perform the techniques, or may include digital electronic devices such as one or more application-specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs) that are persistently programmed to perform the techniques, or may include one or more general purpose hardware processors programmed to perform the techniques pursuant to program instructions in firmware, memory, other storage, or a combination. Such special-purpose computing devices may also combine custom hard-wired logic, ASICs, or FPGAs with custom programming to accomplish the techniques. The special-purpose computing devices may be desktop computer systems, portable computer systems, handheld devices, networking devices or any other device that incorporates hard-wired and/or program logic to implement the techniques.
For example,
Computer system 900 also includes a main memory 906, such as a random access memory (RAM) or other dynamic storage device, coupled to bus 902 for storing information and instructions to be executed by processor 904. Main memory 906 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 904. Such instructions, when stored in non-transitory storage media accessible to processor 904, render computer system 900 into a special-purpose machine that is customized to perform the operations specified in the instructions.
Computer system 900 further includes a read only memory (ROM) 908 or other static storage device coupled to bus 902 for storing static information and instructions for processor 904. A storage device 910, such as a magnetic disk or optical disk, is provided and coupled to bus 902 for storing information and instructions.
Computer system 900 may be coupled via bus 902 to a display 912, such as a cathode ray tube (CRT), for displaying information to a computer user. An input device 914, including alphanumeric and other keys, is coupled to bus 902 for communicating information and command selections to processor 904. Another type of user input device is cursor control 916, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 904 and for controlling cursor movement on display 912. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
Computer system 900 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 900 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 900 in response to processor 904 executing one or more sequences of one or more instructions contained in main memory 906. Such instructions may be read into main memory 906 from another storage medium, such as storage device 910. Execution of the sequences of instructions contained in main memory 906 causes processor 904 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
The term “storage media” as used herein refers to any non-transitory media that store data and/or instructions that cause a machine to operation in a specific fashion. Such storage media may comprise non-volatile media and/or volatile media. Non-volatile media includes, for example, optical or magnetic disks, such as storage device 910. Volatile media includes dynamic memory, such as main memory 906. Common forms of storage media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge.
Storage media is distinct from but may be used in conjunction with transmission media. Transmission media participates in transferring information between storage media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 902. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor 904 for execution. For example, the instructions may initially be carried on a magnetic disk or solid state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 900 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 902. Bus 902 carries the data to main memory 906, from which processor 904 retrieves and executes the instructions. The instructions received by main memory 906 may optionally be stored on storage device 910 either before or after execution by processor 904.
Computer system 900 also includes a communication interface 918 coupled to bus 902. Communication interface 918 provides a two-way data communication coupling to a network link 920 that is connected to a local network 922. For example, communication interface 918 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 918 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links may also be implemented. In any such implementation, communication interface 918 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
Network link 920 typically provides data communication through one or more networks to other data devices. For example, network link 920 may provide a connection through local network 922 to a host computer 924 or to data equipment operated by an Internet Service Provider (ISP) 926. ISP 926 in turn provides data communication services through the world wide packet data communication network now commonly referred to as the “Internet” 928. Local network 922 and Internet 928 both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link 920 and through communication interface 918, which carry the digital data to and from computer system 900, are example forms of transmission media.
Computer system 900 can send messages and receive data, including program code, through the network(s), network link 920 and communication interface 918. In the Internet example, a server 930 might transmit a requested code for an application program through Internet 928, ISP 926, local network 922 and communication interface 918.
The received code may be executed by processor 904 as it is received, and/or stored in storage device 910, or other non-volatile storage for later execution.
In the foregoing specification, embodiments of the invention have been described with reference to numerous specific details that may vary from implementation to implementation. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. The sole and exclusive indicator of the scope of the invention, and what is intended by the applicants to be the scope of the invention, is the literal and equivalent scope of the set of claims that issue from this application, in the specific form in which such claims issue, including any subsequent correction.
This application claims priority to U.S. Application 61/709,142, filed on Oct. 2, 2012, and titled “TECHNIQUES FOR ACCELERATING DATABASE OPERATIONS”, the entire contents of which are incorporated by reference as if fully set forth herein and for all purposes. This application incorporates by reference the entire contents of U.S. application Ser. No. 13/839,525, titled “REMOTE-KEY BASED MEMORY BUFFER ACCESS CONTROL MECHANISM”, filed on equal day herewith, as if fully set forth herein and for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
4817140 | Chandra | Mar 1989 | A |
5133053 | Johnson | Jul 1992 | A |
6175566 | Hahn | Jan 2001 | B1 |
6757790 | Chalmer | Jun 2004 | B2 |
7218643 | Saito | May 2007 | B1 |
8255922 | Fresko | Aug 2012 | B1 |
9052936 | Aron | Jun 2015 | B1 |
9083614 | Falco | Jul 2015 | B2 |
20020191599 | Parthasarathy | Dec 2002 | A1 |
20030061417 | Craddock | Mar 2003 | A1 |
20030105914 | Dearth | Jun 2003 | A1 |
20060095690 | Craddock | May 2006 | A1 |
20060098649 | Shay | May 2006 | A1 |
20090037571 | Bozak | Feb 2009 | A1 |
20120011398 | Eckhardt | Jan 2012 | A1 |
20130013843 | Radovic | Jan 2013 | A1 |
20130036332 | Gove | Feb 2013 | A1 |
20140095651 | Kapil | Apr 2014 | A1 |
20140095805 | Kapil | Apr 2014 | A1 |
20140181454 | Manula | Jun 2014 | A1 |
20140229440 | Venkatesh | Aug 2014 | A1 |
20150278103 | Radovic et al. | Oct 2015 | A1 |
Number | Date | Country |
---|---|---|
2 423 843 | Feb 2012 | EP |
WO 0219115 | Mar 2002 | WO |
WO 02078254 | Oct 2002 | WO |
Entry |
---|
Ming et al., “An Efficient Attribute Based Encryption Scheme with Revocation for Outsourced Data Sharing Control”, dated 2011, 6 pages. |
Wang et al., “HyperSafe: A Lightweight Approach to Provide Lifetime hypervisor Control-Flow Integrity” IEEE, dated 2010, 16 pages. |
Yu et al., “Attribute Based Data Sharing with Attribute Revocation” dated Apr. 13-16, 2010, ASIACCS, 10 pages. |
Wang et al., “Hierarchical Attribute-based Encryption and Scalable User Revocation for Sharing Data in Cloud Servers”, 2011, 12 pages. |
Zhang, Long, “Attribute Based Encryption Made Practical”, dated Apr. 2012, 62 pages. |
Franke et al., “Introduction to the wire-speed processor and architecture” IBM J. RES & DEV. vol. 54 No. 1 Paper 3, dated Jan. 2010, 12 pages. |
Gao et al., “Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand”, dated 2006, 8 pages. |
Lee et al., “A Comprehensive Framework for Enhancing Security in InfiniBand Architecture”, IEEE, vol. 18 No. 10, Oct. 2007, 14 pages. |
Number | Date | Country | |
---|---|---|---|
20140095651 A1 | Apr 2014 | US |
Number | Date | Country | |
---|---|---|---|
61709142 | Oct 2012 | US |