The present invention relates generally to network appliances that can be included in servers, and more particularly to network appliances that can include computing modules with multiple ports for interconnection with other servers or other computing modules.
Networked applications often run on dedicated servers that support an associated “state” for context or session-defined application. Servers can run multiple applications, each associated with a specific state running on the server. Common server applications include an Apache web server, a MySQL database application, PHP hypertext preprocessing, video or audio processing with Kaltura supported software, packet filters, application cache, management and application switches, accounting, analytics, and logging.
Unfortunately, servers can be limited by computational and memory storage costs associated with switching between applications. When multiple applications are constantly required to be available, the overhead associated with storing the session state of each application can result in poor performance due to constant switching between applications. Dividing applications between multiple processor cores can help alleviate the application switching problem, but does not eliminate it, since even advanced processors often only have eight to sixteen cores, while hundreds of application or session states may be required.
Embodiments disclosed herein show appliances with computing elements for use in network server devices. The appliance can include multiple connection points for rapid and flexible processing of data by the computing elements. Such connection points can include, but are not limited to, a network connection and/or a memory bus connection. In some embodiments, computing elements can be memory bus connected devices, having one or more wired network connection points, as well as processors for data processing operations. Embodiments can further include the networking of appliances via the multiple connections, to enable various different modes of operation. Still other embodiments include larger systems that can incorporate such computing elements, including heterogeneous architecture which can include both conventional servers as well as servers deploying the appliances.
In some embodiments, appliances can be systems having a computing module attached to a memory bus to execute operations according to compute requests included in at least the address signals received over the memory bus. In particular embodiments, the address signals can be the physical addresses of system memory space. Memory bus attached computing modules can include processing sections to decode computing requests from received addresses, as well as computing elements for performing such computing requests.
In some embodiments, a computing module 102 can include also include a network connection 134. Thus, computing elements in the computing module 102 can be accessed via memory bus 104 and/or network connection. In particular embodiments, a network connection 134 can be a wired or wireless connection.
Optionally, a system 100 can include one or more conventional memory devices 112 attached to the memory bus 104. Conventional memory device 112 can have storage locations corresponding to physical addresses received over memory bus 104.
According to embodiments, computing module 102 can be accessible via interfaces and/or protocols generated from other devices and processes, which are encoded into memory bus signals. Such signals can take the form of memory device requests, but are effectively operational requests for execution by a computing module 102.
In some embodiments, a XIMM 202 can include a physical interface compatible with an existing memory bus standard. In particular embodiments, a XIMM 202 can include an interface compatible with a dual-in line memory module (DIMM) type memory bus. In very particular embodiments, a XIMM 202 can operate according to a double data rate (DDR) type memory interface (e.g., DDR3, DDR4). However, in alternate embodiments, a XIMM 202 can be compatible with any other suitable memory bus. Other memory buses can include, without limitation, memory buses with separate read and write data buses and/or non-multiplexed addresses. In the embodiment shown, among various other components, a XIMM 202 can include an arbiter circuit 208. An arbiter circuit 208 can decode physical addresses into compute operation requests, in addition to various other functions on the XIMM 202).
A XIMM 202 can also include one or more other non-memory interfaces 234. In particular embodiments, non-memory interfaces 234 can be network interfaces to enable one or more a physical network connections to the XIMM 202.
Accordingly, a XIMM 202 can be conceptualized as having multiple ports composed of the host-device—XIMM interface over memory bus 204, as well as non-memory interface(s) 234.
In the embodiment shown, control device 206 can include a memory controller 206-0 and a host 206-1. A memory controller 206-0 can generate memory access signals on memory bus 204 according to requests issued from host device 206-1 (or some other device). As noted, in particular embodiments, a memory controller 206-0 can be a DDR type controller attached to a DIMM type memory bus.
A host device 206-1 can receive and/or generate computing requests based on an application program or the like. A host device 206-1 can include a request encoder 214. A request encoder 214 can encode computing operation requests into memory requests executable by memory controller 206-0. Thus, a request encoder 214 and memory controller 206-0 can be conceptualized as forming a host device-XIMM interface. According to embodiments, a host device-XIMM interface can be a lowest level protocol in a hierarchy of protocols to enable a host device to access a XIMM 202.
In particular embodiments, a host device-XIMM interface can encapsulate the interface and semantics of accesses used in reads and writes initiated by the host device 206-1 to do any of: initiate, control, configure computing operations of XIMMs 202. At the interface level, XIMMs 202 can appear to a host device 206-1 as memory devices having a base physical address and some memory address range (i.e., the XIMM has some size, but it is understood that the size represents accessible operations rather than storage locations).
Optionally, a system 200 can also include a conventional memory module 212. In a particular embodiment, memory module 212 can be a DIMM.
In some embodiments, an appliance 200 can include multiple memory channels accessible by a memory controller 206-0. A XIMM 202 can reside on a particular memory channel, and accesses to XIMM 202 can go through the memory controller 206-0 for the channel that a XIMM 202 resides on. There can be multiple XIMMs on a same channel, or one or more XIMMs on different channels.
According to some embodiments, accesses to a XIMM 202 can go through the same operations as those executed for accessing storage locations of a conventional memory module 212 residing on the channel (or that could reside on the channel). However, such accesses vary substantially from conventional memory access operations. Based on address information, an arbiter 208 within a XIMM 202 can respond to a host device memory access like a conventional memory module 212. However, within a XIMM 202 such an access can identify one or more targeted resources of the XIMM 202 (input/output queues, a scatter-list for DMA, etc.) and the identification of what device is mastering the transaction (e.g., host device, network interface (NIC), or other bus attached device such as a peripheral component interconnect (PCI) type device). Viewed this way, such accesses of a XIMM 202 can be conceptualized as encoding the semantics of the access into a physical address.
According to some embodiments, a host device-XIMM protocol can be in contrast to many conventional communication protocols. In conventional protocols, there can be an outer layer-2 (L2) header which expresses the semantics of an access over the physical communication medium. In contrast, according to some embodiments, a host device-XIMM interface can depart from such conventional approaches in that communication occurs over a memory bus, and in particular embodiments, can be mediated by a memory controller (e.g., 206-0). Thus, according to some embodiments, all or a portion of a physical memory address can serve as a substitute of the L2 header in the communication between the host device 206-1 and a XIMM 202. Further, an address decode performed by an arbiter 208 within the XIMM 202 can be a substitute for an L2 header decode for a particular access (where such decoding can take into account the type of access (read or write)).
As disclosed herein, according to embodiments, a physical memory addresses received by a XIMM can start or modify operations of the XIMM.
According to embodiments, XIMMs can have read addresses that are different than their write addresses. In some embodiments, XIMMs can be accessed by memory controllers with a global write buffer (GWB) or another similar memory caching structure. Such a memory controller can service read requests from its GWB when the address of a read matches the address of a write in the GWB. Such optimizations may not be suitable for XIMM accesses in some embodiments, since XIMMs are not conventional memory devices. For example, a write to a XIMM can update the internal state of the XIMM, and a subsequent read would have to follow after the write has been performed at the XIMM (i.e., such accesses have to performed at the XIMM, not at the memory controller). In some particular embodiments, a same XIMM can have different read and write address ranges. In such an arrangement, reads from a XIMM that have been written to will not return data from the GWB.
XIMMs 702-0/1 can be attached to memory bus 704, and can be accessed by read and/or write operations by memory controller 706-0. XIMMs 702-0/1 can have read addresses that are different from write addresses (ADD Read !=ADD Write).
Optionally, an appliance 700 can include a conventional memory device (DIMM) 712 attached to the same memory bus 704 as XIMMs 702-0/1. Conventional memory device 712 can have conventional read/write address mapping, where data written to an address is read back from the same address.
According to some embodiments, host devices (e.g., x86 type processors) of an appliance can utilize processor speculative reads. Therefore, if a XIMM is viewed as a write-combining or cacheable memory by such a processor, the processor may speculate with reads to the XIMMs. As understood from herein, reads to XIMMs are not data accesses, but rather encoded operations, thus speculative reads could be destructive to a XIMM state.
Accordingly, according to some embodiments, in systems having speculative reads, XIMM read address ranges can be mapped as uncached. Because uncached reads can incur latencies, in some embodiments, XIMMs accesses can vary according to data output size. For encoded read operations that result smaller data outputs from the XIMMs (e.g., 64 to 128 bytes), such data can be output in a conventional read fashion. However, for larger data sizes, where possible, such accesses can involve direct memory access (DMA) type transfers (or DMA equivalents of other memory bus types).
In systems according to some embodiments, write caching can be employed. While embodiments can include XIMM write addresses that are uncached (as in the case of read addresses) such an arrangement may be less desirable due to the performance hit incurred, particularly if accesses include burst writes of data to XIMMs. Write-back caching can also yield unsuitable results if implemented with XIMMs. Write caching can result in consecutive writes to the same cache line, resulting in write data from a previous access being overwritten. This can essentially destroy any previous write operation to the XIMM address. Write-through caching can incur extra overhead that is unnecessary, particularly when there may never be reads to addresses that are written (i.e., embodiments when XIMM read addresses are different from their write addresses).
In light of the above, according to some embodiments a XIMM write address range can be mapped as write-combining. Thus, such writes can be stored and combined in some structure (e.g., write combine buffer) and then written in order into the XIMM.
The particular control device 806 shown can also include a cache controller 806-2 connected to memory bus 804. A cache controller 806-2 can have a cache policy 826, which in the embodiment shown, can treat XIMM read addresses a uncached, XIMM write addresses as write combining, and addresses for conventional memories (e.g., DIMMs) as cacheable. A cache memory 806-3 can be connected to the cache controller 806-2. While
According to embodiments, an address that accesses a XIMM can be decomposed into a base physical address and an offset (shown as ADD Ext 1, ADD Ext 2 in
As noted above, for systems with memory controllers having a GWB or similar type of caching, XIMMs can have separate read and write address ranges. Furthermore, read address ranges can be mapped as uncached, in order to ensure that no speculative reads are made to a XIMM. Writes can be mapped as write-combining in order to ensure that writes always get performed when they are issued, and with suitable performance (see
According to embodiments, address ranges for XIMMs can be chosen to be a multiple of the largest page size that can be mapped (e.g., either 2 or 4 Mbytes). Since these page table mappings may not be backed up by RAM pages, but are in fact a device mapping, a host kernel can be configured for as many large pages as it takes to map a maximum number of XIMMs. As but one very particular example, there can be 32 to 64 large pages/XIMM, given that the read and write address ranges must both have their own mappings.
As noted above, according to some embodiments data transfers between XIMMs and a data source/sink can vary according to size.
According to some embodiments, a type of write operation to a XIMM can vary according to write data size.
Possible data transfer paths to/from XIMMs 1202-0/1 can include a path 1262-0 between processor(s) 1206-1 and a XIMM 1202-0, a path 1242-1 between a bus attached (e.g., PCI) device 1206-5 and a XIMM 1202-0, and a path 1242-2 between one XIMM 1202-0 and another XIMM 1202-1. In some embodiments, such data transfers (1242-0 to -2) can occur through DMA or equivalent type transfers.
In particular embodiments, an appliance can include host-XIMM interface that is compatible with DRAM type accesses (e.g., DIMM accesses). In such embodiments, accesses to the XIMM can be via row address strobe (RAS) and then (in some cases) a column address strobe (CAS) phase of a memory access. As understood from embodiments herein, internally to the XIMM, there is no row and column selection of memory cells as would occur in a conventional memory device. Rather, the physical address provided in the RAS and (optionally CAS) phases can inform circuits within the XIMM (e.g., an arbiter 208 of
As noted herein, a XIMM can include an arbiter for handling accesses over a memory bus. In embodiments where address multiplexing is used (i.e., a row address is followed by a column address), an interface/protocol can encode certain operations along address boundaries of the most significant portion of a multiplexed address (most often the row address). Further such encoding can vary according to access type.
In particular embodiments, how an address is encoded can vary according to the access type. In an embodiment with row and column addresses, an arbiter within a XIMM can be capable of locating the data being accessed for an operation and can return data in a subsequent CAS phase of the access. In such an embodiment, in read accesses, a physical address presented in the RAS phase of the access identifies the data for the arbiter so that the arbiter has a chance to respond in time during the CAS phase. In a very particular embodiment, read addresses for XIMMs are aligned on a row address boundaries (e.g., 4K boundary assuming a 12-bit row address).
While embodiments can include address encoding limitations in read accesses to ensure rapid response, such a limitation may not be included in write accesses, since no data will be returned. For writes, an interface may have a write address (e.g., row address, or both row and column address) completely determine a target within the XIMM to which the write data are sent.
In some appliances, a control device can include a memory controller that utilizes error correction and/or detection (ECC). According to some embodiments, in such an appliance ECC can be disabled, at least for accesses to XIMMs. However, in other embodiments, XIMMs can be include the ECC algorithm utilized by the memory controller, and generate the appropriate ECC bits for data transfers.
It is noted that
In some embodiments, all reads of different resources in a XIMM can fall on a separate range (e.g., 4K) of the address. An address map can divide the address offset into three (or more four) fields: Class bits; Selector bits; Additional address metadata; and optionally (Read/write bit). Such fields can have the following features:
Class bits: can be used to define the type of transaction encoded in the address
Selector bits: can be used to select a FIFO or a processor (e.g., ARM) within a particular class, or perhaps specify different control operations.
Additional address metadata: can be used to further define if a particular class of transaction involving the compute elements.
Read/write: One (or more) bits can be used to determine whether the access applies to a read or a write. This can be a highest bit of the physical address offset for the XIMM.
Furthermore, according to embodiments, an address map can be large enough in range to accommodate transfers to/from any given processor/resource. In some embodiments, such a range can be at least 256 Kbytes, more particularly 512 Kbytes.
Input formats according to very particular embodiments will now be described. The description below points out an arrangement in which three address classes can be encoded in the upper bits of the physical address, (optionally allowing for a R/W bit and) for a static 512K address range for each processor/resource. The basic address format for a XIMM according to this particular embodiment, is shown in Table 1:
In such an address mapping like that of Table 1, a XIMM can have a mapping of up to 128 Mbytes in size, and each read/write address range can be 64 Mbytes in size. There can be 16 Mbytes/32=512 Kbytes available for data transfer to/from a processor/resource. There can be an additional 4 Mbytes available for large transfers to/from only one processor/resource at a time. In the format above, bits 25, 24 of the address offset can determine the address class. An address class determine the handling and format of the access. In one embodiment, there can be three address classes: Control, APP and DMA.
Control: There can be two types of Control inputs—Global Control and Local Control. Control inputs can be used for various control functions for a XIMM, including but not limited to: clock synchronization between a request encoder (e.g., XKD) and an Arbiter of a XIMM; metadata reads; and assigning physical address ranges to a compute element, as but a few examples. Control inputs may access FIFOs with control data in them, or may result in the Arbiter updating its internal state.
APP: Accesses which are of the APP class can target a processor (ARM) core (i.e., computing element) and involve data transfer into/out of a compute element.
DMA: This type of access can be performed by a DMA device. Optionally, whether it is a read or write can be specified in the R/W bit in the address for the access.
Each of the class bits can determine a different address format. An arbiter within the XIMM can interpret the address based upon the class and whether the access is a read or write. Examples of particular address formats are discussed below.
Possible address formats for the different classes are as follows:
One particular example of a Control Address Format according to an embodiment is shown in Table 2.
Class bits 00b: This is the address format for Control Class inputs. Bits 25 and 24 can be 0. Bit 23 can be used to specify where the Control input is Global or Local. Global control inputs can be for an arbiter of a XIMM, whereas a local control input can be for control operations of a particular processor/resource within the XIMM (e.g., computing element, ARM core, etc.). Control bits 22 . . . 12 are available for a Control type and/or to specify a target resource. An initial data word of 64 bits can be followed by “payload” data words, which can provide for additional decoding or control values.
In a particular embodiment, bit 23=1 can specify Global Control. Field “XXX” can be zero for reads (i.e., the lower 12 bits), but these 12 bits can hold address metadata for writes, which may be used for Local Control inputs. Since Control inputs are not data intensive, not all of the Target/Cntrl Select bits may be used. A 4K max inputs size can be one limit for Control inputs. Thus, when the Global bit is 0 (Control inputs destined for an ARM), only the Select bits 16 . . . 12 can be set.
One particular example of an Application (APP) Address Format is shown in Table 3. In the example shown, for APP class inputs, bit 25=0, bit 24=1. This address format can have the following form (RW may not be included):
Field “XXX” may encode address metadata on writes but can be all be 0's on reads.
It is understood that a largest size of transfer that can with a fixed format scheme like that shown can be 512K. Therefore, in a particular embodiment, bits 18 . . . 12 can be 0 so that the Target Select bits are aligned on a 512K boundary. The Target Select bits can allow for a 512K byte range for every resource of the XIMM, with an additional 4 Mbytes that can be used for a large transfer.
One particular example of a DMA Address Format is shown in Table 4. For a DMA address class bits can be 10b. This format can be used for a DMA operation to or from a XIMM. In some embodiments, control signals can indicate read/write. Other embodiments may include bit 26 to determine read/write.
In embodiments in which a XIMM can be accessed over a DDR channel, a XIMM can be a slave device. Therefore, when the XIMM Arbiter has an output queued up for the host or any other destination, it does not master the DDR transaction and send the data. Instead, such output data is read by the host or a DMA device. According to embodiments, a host and the XIMM/Arbiter have coordinated schedules, thus the host (or other destination) knows the rate of arrival/generation of at a XIMM and can time its read accordingly.
Embodiments can include other metadata that can be communicated in read from a XIMM as part of a payload. This metadata may not be part of the address and can be generated by an Arbiter on the XIMM. The purpose of Arbiter metadata in a request encoder (e.g., XKD)-Arbiter interface can be to communicate scheduling information so that the request encoder can schedule reads in a timely enough manner in order to minimize the latency of XIMM processing, as well as avoiding back-pressure in the XIMMs.
Therefore, in some embodiments, a request encoder-Arbiter having a DDR interface can operate as follows. A request encoder can encode metadata in the address of DDR inputs sent to the Arbiter, as discussed above already. Clock synchronization and adjustment protocols can maintain a clock-synchronous domain of a request encoder instance and its DDR-network of XIMMs. All XIMMs in the network can maintain a clock that is kept in sync with the local request encoder clock. A request encoder can timestamp of inputs it sends to the Arbiter. When data are read from the Arbiter by the request encoder (e.g., host), the XIMM Arbiter can write metadata with the data, communicating information about what data is available to read next. Still further, a request encoder can issue control messages to an Arbiter to query its output queue(s) and to acquire other relevant state information.
According to embodiments, XIMMs in a same memory domain can operate in a same clock domain. XIMMs of a same memory domain can be those that are directly accessible by a host device or other request encoder (e.g., an instance of an XKD and those XIMMs that are directly accessible via memory bus accesses). Hereinafter, reference to an XKD is understood to be any suitable request encoder.
A common clock domain can enable the organization of scheduled accesses to keep data moving through the XIMMs. According to some embodiments, an XKD does not have to poll for output or output metadata on its own host schedule, as XIMM operations can be synchronized for deterministic operations on data. An Arbiter can communicate at time intervals when data will be ready for reading, or at an interval of data arrival rate, as the Arbiter and XKD can have synchronized clock values.
Thus, according to embodiments, each Arbiter of a XIMM can implement a clock that is kept in sync with an XKD. When a XKD discovers a XIMM through a startup operation (e.g., SMBIOS operation) or through a probe read, the XKD can seek to sync up the Arbiter clock with its own clock, so that subsequent communication is deterministic. From then on, the Arbiter will implement a simple clock synchronization protocol to maintain clock synchronization, if needed. Such synchronization may not be needed, or may be needed very infrequently according to the type of clock circuits employed on the XIMM.
According to very particular embodiments, an Arbiter clock can operate with fine granularity (e.g., nanosecond granularity) for accurate timestamping. However, for operations with a host, an Arbiter can sync up with a coarser granularity (e.g., microsecond granularity). In some embodiments, a clock drift of up to one μsec can be allowed.
Clock synchronization can be implemented in any suitable way. As but one example, periodic clock values can be transmitted from one device to another (e.g., controller to XIMM or vice versa). In addition or alternatively, circuits can be used for clock synchronization, including but not limited to PLL, DLL circuits operating on an input clock signal and/or a clock recovered from a data stream.
Referring to
Referring to
According to some embodiments, whenever an arbiter responds to a read request from the host, where the read is not a DMA read, an arbiter can include the following metadata: (1) a timestamp of the input when it arrived in storage circuits of the arbiter (e.g., a FIFO of the arbiter); (2) information for data queued up from a XIMM, (e.g., source, destination, length). The arbiter metadata can be modified to accommodate a bulk interface. A bulk interface can handle up to some maximum number of inputs, with source and length for each input queued. Such a configuration can allow bulk reads of arbiter output and subsequent queuing in memory (e.g., RAM) of a XIMM output so that the number of XKD transactions can be reduced.
According to some embodiments, an appliance can issue various control messages from an XKD to an arbiter of a XIMM. Control messages are described below, and can be a subset of the control messages that a request encoder can send to an arbiter according to very particular embodiments. The control messages described here can assist in the synchronization between the XKD and the Arbiter.
Probe read: This can be read operations issued that are used for XIMM discovery. An Arbiter of any XIMM can return the data synchronously for the reads. The data returned can be constant and identify the device residing on the bus as a XIMM. In a particular embodiment, such a response can be 64 bytes and includes XIMM model number, XIMM version, operating system (e.g., Linux version running on ARM cores), and other configuration data.
Output snapshot: This can be a read operation to XIMM to get information on any Arbiter output queues, such as the lengths of each, along with any state that is of interest for a queue. Since these reads are for the Arbiter, in a format like that of Table 1 a global bit can be set. In a very particular embodiment bit 21 can be set.
Clock sync: This operation can be used to set the clock base for the Arbiter clock. There can be a clock value in the data (e.g., 64 bit), and the rest of the input can be padded with 0's. In a format like that of Table 1 a global bit can be set, and in a very particular embodiment bit 23 can be set. It is noted that a XKD can send a ClockSync input to the Arbiter if a read from a XIMM shows the Arbiter clock to be too far out of sync (assuming the read yields timestamp or other synchronization data).
Embodiments herein have described XIMM address classes and formats used in communication with a XIMM. While some semantics are encoded in the address, for some transactions it may not be possible to encode all semantics, nor to include parity on all inputs, or to encode a timestamp, etc. This section discusses the input formats that can be used at the beginning of the data that is sent along with the address of the input. The description shows Control and APP class inputs and are assumed to be DDR inputs, thus there can be data encoded in the address, and the input header can be sent at the head of the data according to the formats specified.
The below examples correspond to a format like that shown in Table 1.
Data can be returned synchronously for Probe Reads and can identify the memory device as a XIMM.
Address: (XIMM base addr)+(class bits=00b)+(bit 23=1)+(Control select=bit setting for XIMM_PROBE). Table 5 shows an example of returned data.
This next input is the response to an OUTPUT_PROBE:
Address: (XIMM base addr)+(class bits=00b)+(bit 23=1)+(Control select=bit setting for APP_SCHEDULING)
This format assumes output from a single source. Alternate embodiments can be modified to accommodate bulk reads, so that one read can absorb multiple inputs, with an XKD buffering the input data. Table 6 shows an example of returned data.
The following is the CLOCK_SYNC input, sent by a XKD when it first identifies a XIMM or when it deems the XIMM as being too out of sync with the XKD.
Address: (XIMM base addr)+(class bits=00b)+(bit 23=1)+(Control select=bit setting for CLOCK_SYNC). Table 7 shows an example.
This next input can be issued after the XIMM has indicated in its metadata that no output is queued up. When a XKD encounters that, it can start polling an Arbiter for an output (in some embodiments, this can be at predetermined intervals).
Address: (XIMM base addr)+(class bits=00b)+(bit 23=1)+(Control select=bit setting for OUTPUT_PROBE). Table 8 shows an example.
The following input can be sent by an XKD to associate a Xocket ID with a compute element of a XIMM (e.g., an ARM core). From then on, the Xocket ID can be used in Target Select bits of the address for Local Control or APP inputs.
Address: (XIMM base addr)+(class bits=00b)+(bit 23=1)+(Control select=bit setting for SET_XOCKET_MAPPING). Table 9 shows an example.
The following input can be used to set a Large Transfer (Xfer) Window mapping. In the example shown, it is presumed that no acknowledgement is required. That is, once this input is sent, the next input using the Large Xfer Window should go to the latest destination.
Address: (XIMM base addr)+(class bits=00b)+(bit 23=1)+(Control select=bit setting for SET_LARGE_XFER_WNDW). Table 10 shows an example.
Writes:
Below is an example of an input format for writes to a socket/application on a computing resource (e.g., ARM core). Note that for these types of writes, all writes to the same socket or to the same physical address can be of this message until M/8 bytes of the payload are received, and the remaining bytes to a 64B boundary is zero-filled. If a parity or a zero fill is indicated, errors can be posted in the monitoring status (see Reads). That is, writes may be interleaved if the different writes are targeting different destinations within the XIMM. The host drivers can make sure that there is only one write at a time targeting a given computing resource. Table 12 shows an example.
Reads:
Below in Tables 13 and 14 is a scatter transmit example. Class=APP; Decode=SCATTER_TX.
Below in Table 15 a gather receive example is shown. Class=APP, Decode=GATHER_RX.
While embodiments can include network appliances, including those with XIMMs, other embodiments can include computing infrastructures that employ such appliances. Such infrastructures can run different distributed frameworks for “big data” processing, as but one limited example. Such a computing infrastructure can host multiple diverse, large distributed frameworks with as little change as compared to conventional systems.
A computing infrastructure according to particular embodiments can be conceptualized as including a cluster infrastructure and a computational infrastructure. A cluster infrastructure can manage and configure computing clusters, including but not limited to cluster resource allocation, distributed consensus/agreement, failure detection, replication, resource location, and data exchange methods.
A computational infrastructure according to particular embodiments can be directed to unstructured data, and can include two classes of applications: batch and streaming. Both classes of applications can apply the same types of transformations to the data sets. However, the applications can differ in the size of the data sets (the batched applications, like Hadoop, can typically be used for very large data sets). However, but the data transformations can be similar, since the data is fundamentally unstructured and that can determine the nature of the operations on the data.
According to embodiments, computing infrastructures can include network appliances (referred to herein as appliances), as described herein, or equivalents. Such appliances can improve the processing of data by the infrastructures. Such an appliance can be integrated into server systems. In particular embodiments, an appliance can be placed within the same rack or alternatively, a different rack than a corresponding server.
A computing infrastructure can accommodate different frameworks with little porting effort and ease of configuration, as compared to conventional systems. According to embodiments, allocation and use of resources for a framework can be transparent to a user.
According to embodiments, a computing infrastructure can include cluster management to enable the integration of appliances into a system having other components.
Cluster infrastructures according to embodiments will now be described. According to embodiments, applications hosted by a computing system can include a cluster manager. As but one particular example, Mesos can be used in the cluster infrastructure. A distributed computation application can be built on the cluster manager (such as Storm, Spark, Hadoop), that can utilize unique clusters (referred to herein as Xockets clusters) based on computing elements of appliances deployed in the computing system. A cluster manager can encapsulate the semantics of different frameworks to enable the configuration of different frameworks. Xockets clusters can be divided along framework lines.
A cluster manager can include extensions to accommodate Xockets clusters. According to embodiments, resources provided by Xockets clusters can be described in terms of computational elements (CEs). A CE can correspond to an elements within an appliance, and can include any of: processor core(s), memory, programmable logic, or even predetermined fixed logic functions. In one very particular embodiment, a computational element can include two ARM cores, a fixed amount of shared synchronous dynamic RAM (SDRAM), and one programmable logic unit. As will be described in more detail below, in some embodiments, a majority if not all of the computing elements can be formed on XIMMs, or equivalent devices, of the appliance. In some embodiments, computational elements can extend beyond memory bus mounted resources, and can include other elements on or accessible via the appliance, such as a host processor (e.g., x86 processor) of the appliance and some amount RAM. The latter resources reflect how appliance elements can cooperate with XIMM elements in a system according to embodiment.
The above description of XIMM resources is in contrast to conventional server approaches, which may allocate resources in terms of processors or Gbytes of RAM, typical metrics of conventional server nodes.
According to embodiments, allocation of Xockets clusters can vary according to the particular framework.
A Xockets translation layer 1804 can translate framework calls into requests relevant for a Xockets cluster 1806. A Xockets translation layer 1804 can be relevant to a particular framework and its computational infrastructure. As will be described further below, a Xockets computational infrastructure can be particular to each distributed framework being hosted, and so the particulars of a framework's resource requirements will be understood and stored with the corresponding Xockets translation layer (1804). As but one very particular example, a Spark transformation on a Dstream that is performing a countByWindow could require one computational element, whereas a groupByKeyAndWindow might require two computational elements, an x86 helper process and some amount of RAM depending upon window size. For each Xockets cluster there can be a resource list associated with the different transformations associated with a framework. Such a resource list is derived from the computational infrastructure of the hosted framework.
A Xockets cluster 1806 can include various computing elements CE0 to CEn, which can take the form of any of the various circuits described herein, or equivalents (i.e., processor cores, programmable logic, memory, and combinations thereof). In the particular implementation shown, a Xockets cluster 1806 can also include a host processor, which can be resident on the appliance housing the XIMMs which contain the computing elements (CE0 to CEn). Computing elements (CE0 to CEn) can be accessed by XKD 1812.
In other embodiments, a framework can run on one or more appliances and one or more regular servers clusters (i.e., a hybrid cluster). Such an arrangement is shown in
Hybrid cluster 1908 can include conventional cluster elements such as processors 1910-0/1 and RAM 1910-2. In the embodiment shown, a proxy layer 1914 can run above XKD and can communicate with the cluster manager 1902 master. In one very particular example of a hybrid cluster arrangement, an appliance can reside under a top-of-the-rack (TOR) switch and can be part of a cluster that includes conventional servers from the rest of the rack, as well as even more racks, which can also contain one or more Appliances. For such hybrid clusters, additional policies can be implemented.
In a hybrid cluster, frameworks can be allocated resources from both Appliance(s) and regular servers. In some embodiments, a local Xockets driver can be responsible for the allocation of its local XIMM resources (e.g., CEs). That is, resources in an Appliance can be tracked and managed by the Xockets driver running on the unit processor (e.g., x86s) on the same Appliance.
According to embodiments, in hybrid clusters, Xockets resources can continue to be offered in units of computational elements (CEs). Note, in some embodiments, such CEs may not include the number of host (e.g., x86) processors or cores. In very particular embodiments, appliances can include memory bus mounted XIMMs, and CE resources may be allocated from the unit processor (e.g., x86) driver mastering the memory bus of the appliance (to which the XIMMs are connected).
As shown in
For hybrid clusters, resources can be allocated between Xockets nodes and regular nodes (i.e., nodes made of regular servers). According to some embodiments, a default allocation policy can be for framework resources to use as many Xockets resources as are available, and rely upon traditional resources only when there are not enough of the Xockets resources. However, for some frameworks, such a default policy can be overridden, allowing resources to be divided for best results. As but one very particular example, in a Map-Reduce computation, it is very likely the Mappers or Reducers will run on a regular server processor (x86) and the Xockets resources can be used to ameliorate the shuffle and lighten the burden of the reduce phase, so that Xockets clusters are working cooperatively with regular server nodes. In this example the framework allocation would discriminate between regular and Xockets resources.
Thus, in some embodiments, a cluster manager will not share the same Xockets cluster resources across frameworks. Xockets clusters can be allocated to particular frameworks. In some embodiments, direct communication between a cluster manager master and slaves computational elements can be proxied on the host processor (x86) if the cluster manager master is running locally. A Xockets driver can control the XIMM resources (CEs) and that control plane can be conceptualized as running over the cluster manager.
Referring still to
Thus, in some embodiments, a system can employ a cluster manager for Xocket clusters, but not for sharing Xockets clusters across different frameworks, but for configuring and allocating Xockets nodes to particular frameworks.
Computational Infrastructures according to embodiments will now be described. According to embodiments, systems can utilize appliances for processing unstructured data sets, in various modes, including batch or streaming. The operations on big unstructured data sets are pertinent to the unstructured data and can represent the transformations performed on a data set having its characteristics.
According to embodiments, a computational infrastructure can include a Xockets Software Defined Infrastructure (SDI). A Xockets SDI can minimize porting to the ARM cores of CEs, as well as leverage a common set of transformations across the frameworks that the appliances can support.
According to embodiments, frameworks can run on host processors (x86s) of an appliance. There can be little control plane presence on the XIMM processor (ARM) cores, even in the case the appliance operates as a cluster manager slave. As understood from above, part of the cluster manager slave can run on the unit processor (x86) while only a stripped down and part runs on the XIMM processors (ARMs) (see
If a framework requires more communication with a “Xockets node” (e.g., the Job Tracker communicating with the Task Tracker in Hadoop), such communication can happen on the host processor (x86) between a logical counterpart representing the Xockets node, with the XKD mediating to provide actual communication to XIMM elements.
In such an arrangement, frameworks operating on unstructured data can be implemented as a pipelined graph constructed from transformational building blocks. Such building blocks can be implemented by computations assigned to XIMM processor cores. Accordingly, in some embodiments, the distributed applications running on appliances can perform transformations on data sets. Particular examples of data set transformations can include, but are not limited to: map, reduce, partition by key, combine by key, merge, sort, filter or count. These transformations are understood to be exemplary “canonical” operations (e.g., transformations). XIMM processor cores (and/or any other appliance CEs) can be configured for any suitable transformation.
Thus, within a Xockets node, such transformations can be implemented by XIMM hardware (e.g., ARM processors). Each such operation can take a function/code to implement, such as a map, reduce, combine, sort, etc.
Each of the transformations may take input parameters, such as a string to filter on, a key to combine on, etc. A global framework can be configured by allocating the amount of resources to the XIMM cluster that correlates to the normal amount of cluster resources in the normal cluster, and then assigning roles to different parts of the XIMMs or to entire XIMMs. From this a workflow graph can be constructed, defining inputs and outputs at each point in the graph.
According to embodiments, framework requests for services can be translated into units corresponding to the Xockets architecture. Therefore, a Xockets SDI can implement the following steps: (1) Determine types of computation that is being carried out by a framework. This is can be reflected in the framework's configuration of a job that it will run on the cluster. This information can result in a framework's request for resources. For example, a job might result in a resource list for N nodes to implement a filter-by-key, K nodes to do a parallel join, as well as M nodes to participate in a merge. These resources are essentially listed out by their transformations, as well as how to hook them together in work-flow graph. (2) Once this list and types of transformations is obtained, the SDI can translate this into the resources required to implement on a Xockets cluster. The Xockets SDI can include a correlation between fundamental transformations for a particular framework and XIMM resources. A Xockets SDI can thus map transformations to XIMM resources needed. At this point any constraints that exist are applied as well (e.g., there might be a need to allocate two computational elements on the same XIMM but in different communication rings for a pipelined computation).
Collected packet data can be reassembled into corresponding complete values (2736, 2738, 2740). Such an action can include packet processing using server resources, including any of those described herein. Based characteristics of the values (e.g., 2734-0, 2734-1, 2734-2), complete values can be arranged in subsets 2746-0/1.
Transformations can then be made on the subsets as if they were originating from a same network session (2748, 2750). Such action can include utilizing CEs of a an appliance as described herein. In particular embodiments, this can include streaming data through CEs XIMMs deployed in appliances.
Transformed values 2756 can be emitted as packets on other network sessions 2740-x, 2740-y.
In a particular example, when a system is configured for a streaming data processing (e.g., Storm), it can be determined where data sources (e.g., Spouts) are, and how many of them there are. As but one particular example, an input stream can comes in from a network through a top of the rack switch (TOR), and a XIMM cluster can be configured with the specified amount of Spouts all running on a host processor (x86). However, if input data is sourced off storage of the XIMMs (e.g., an HDFS file system on the flash), the Spouts can be configured to run on the XIMMs, wherever HDFS blocks are read. Operations (e.g., Bolts) can run functions supplied by the configuration, typically something from the list above. For Bolts, frameworks for a filter bolt or a merge bolt or a counter, etc. can be loaded, and the Spouts can be mapped to the Bolts, and so on. Furthermore, each Bolt can be configured to perform its given operation with predetermined parameters, and then as part of the overall data flow graph, it will be told where to send its output, be it to another computational element on the same XIMM, or a network (e.g., IP) address of another XIMM, etc. For example, a Bolt may need to be implemented that does a merge sort. This may require two pipelined computational elements on a same XIMM, but on different communication rings, as well as a certain amount of RAM (e.g., 512 Mbytes) in which to spill the results. These requirements can be constraints placed on the resource allocation and therefore can to be part of the resource list associated with a particular transformation that Storm will use. While the above describes processes with respect to Storm, one skilled in the art would understand different semantics can be used for different processes.
Canonical transformation that are implemented as part of the Xockets computational infrastructure can have an implementation using Xockets streaming architecture. A streaming architecture can implement transformations on cores (CEs), but in an optimal manner that reduces copies and utilizes HW logic. The HW logic couples input and outputs and schedules data flows among or across XIMM processors (ARMs) of the same or adjacent CEs. The streaming infrastructure running on the XIMM processors can have hooks to implement a computational algorithm in such a way that it is integrated into a streaming paradigm. XIMMs can include special registers that accommodate and reflect input from classifiers running in the XIMM processor cores so that modifications to streams as they pass through the computational elements can provide indications to a next phase of processing of the stream.
As noted above, an infrastructures according to embodiments can include XIMMs in an appliance.
A computational intensive XIMM 2902-A can have a number of cores (e.g., 24 ARM cores), programmable logic 2905 and a programmable switch 2907. A storage intensive XIMM can include a smaller number of cores (e.g., 12 ARM cores) 2901, programmable logic 2905, a programmable switch 2907, and relatively large amount of storage (e.g., 1.5 Tbytes of flash memory) 2903. Each XIMM 2902-A/B can also include one or more network connections 2909.
A network of XIMMs 3051 can form a XIMM cluster, whether they be computational intensive XIMMs, storage intensive XIMMs, or some combination thereof. The network of XIMMs can occupy one or more rack units. A XIMM cluster can be tightly coupled, unlike conventional data center clusters. XIMMs 3051 can communicate over a DDR memory bus with a hub-and-spoke model, with a XXKD (e.g., an x86 based driver) being the hub. Hence over DDR the XIMMs are all tightly-coupled and the XIMMs operate in a synchronous domain over the DDR interconnect. This is in sharp contrast to a loosely-coupled asynchronous cluster.
Also, as understood from above, XIMMs can communicate via network connections (e.g., 2909) in addition to via a memory bus. In particular embodiments, XIMMs 3051 can have has a network connection that is connected to either a top of rack (TOR) or to other servers in the rack. Such a connection can enable peer-to-peer XIMM-to-XIMM communication that do not require a XKD to facilitate the communication. So, with respect to the network connectors the XIMMs can be connected to each other or to other servers in a rack. To a node communicating with a XIMM node through the network interface, the XIMM cluster can appear to be a cluster with low and deterministic latencies. i.e., the tight coupling and deterministic HW scheduling within the XIMMs is not typical of an asynchronous distributed system.
According to embodiments, XIMMs can have connections, and be connected to one another for various modes of operation.
As understood, a XIMM can have at least two types of external interfaces, one that connects the XIMMs to a host computer (e.g., CPU) via a memory bus 3204 (referred to as DDR, but not being limited to any particular memory bus) and one or more dedicated network connections 3234 provided on eacg XIMM 3202. Each XIMM 3202 can support multiple network ports. Disclosed embodiments can include up to two 10 Gbps network ports. Within a XIMM 3202, these interfaces connect directly to the arbiter 3208 which can be conceptualized as an internal switch fabric exposing all the XIMM components to the host through DDR in an internal private network.
A XIMM 3202 can be configured in various ways for computation.
An arbiter 3208 can operate like an internal (virtual) switch, as it can connect multiple types of media, and so can have multi-layer capabilities. According to an embodiment, core capabilities of an arbiter 3208 can include, but are not limited to, switching based on:
1. Proprietary L2 protocols
2. L2 Ethernet (possibly vlan tags)
3. L3 IP headers (for session redirection)
XIMM network interface(s) 3234 can be owned and managed locally on the XIMM by a computing element (CE, such as an ARM processor) (or a processor core of a CE), or alternatively, by an XKD thread on the host responsible for a XIMM. For improved performance, general network/session processing can be limited, with application specific functions prioritized. For those embodiments in which an XKD thread handles the core functionality of the interface, XKD can provide reflection and redirection services through Arbiter programming for specific session/application traffic being handled on the CE's or other XIMMs on the host.
In such embodiments, a base standalone configuration for a XIMM can be equivalent of two network interface cards (nics), represented by two virtual interfaces on the host. In other embodiments, direct server connections such as port bonding on the XIMM can be used.
In some applications, particularly when working with a storage intensive XIMMs (e.g.,
In some embodiments, during a XIMM discovery/detection phase, an XKD thread responsible for the XIMM can instantiate a new network driver (virtual interface) that corresponds to the physical port on the XIMM. Additionally, an arbiter's default table can be initially setup to pass all network traffic to the XKD, and similarly forward any traffic from XKD targeted to the XIMM network port to it as disclosed in for embodiments herein.
Interfaces for XIMMS will now be described with reference to
Referring to
Arbiter 3208 is in effect operating as an internal (virtual) switch. Since the arbiter connects multiple types of media, it has multi-layer capabilities. Core capabilities include but are not limited to switching based on: Proprietary L2 protocols; L2 Ethernet (possibly vlan tags); and L3 IP headers (for session redirection).
XIMM network interface(s) (3218/3234) can be owned and managed locally on the XIMM by a computing element (CE, such as an ARM processor) (or a processor core of a CE), or alternatively, by a driver (referred to herein as XKD) thread on the host responsible for that XIMM. For improved performance, general network/session processing can be limited, with application specific functions prioritized. For those embodiments in which an XKD thread handles the core functionality of the interface, XKD can provide reflection and redirection services through arbiter 3208 programming for specific session/application traffic being handled on the CE's or other XIMMs on the host.
In this model, the base standalone configuration for a XIMM 3202 can be equivalent of two network interface cards (nics), represented by two virtual interfaces on the host. In other embodiments, direct server connections such as port bonding on the XIMM can be used.
XIMMs 3202 can take various forms including a Compute XIMM and a Storage XIMM. A Compute XIMM can have a number of cores (e.g., 24 ARM cores), programmable logic and a programmable switch. A Storage XIMM can include a smaller number of cores (e.g., 12 ARM cores), programmable logic, a programmable switch, and relatively large amount of storage (e.g., 1.5 Tbytes of flash memory).
In some applications, particularly when working with a Storage XIMM, an arbiter 3208 can act as a L2 switch and have up to every CE in the XIMM own its own network interface.
As shown in
This ensures that the host stack will have all access to this interface and all the capabilities of the host stack are available.
These XIMM interfaces can be instantiated in various modes depending on a XIMM configuration, including but not limited to: (1) A host mode; (2) a compute element/storage element (CE/SE) mode (internal and/or external); (3) as server extension mode (including as a proxy across the appliance, as well as internal connectivity).
The modes in which the interfaces are initialized have a strong correlation to the network demarcation point for that interface. Table 15 shows network demarcation for the modes noted above.
A XIMM assisted appliance (appliance) can be connected to the external world depending on the framework(s) being supported. For many distributed applications the appliance sits below the top of rack (TOR) switch with connectivity to both the TOR switch and directly attached to servers on the rack. In other deployments, as in the case of support of distributed storage or file systems the appliance can be deployed with full TOR connectivity serving data directly from SE devices in the XIMMs.
Even though the appliance functions in part as a networking device (router/switch) given its rich network connectivity, for particular Big Data appliance applications, it can always terminates traffic. Typically, such an appliance doesn't route or switch traffic between devices nor does it participate in routing protocols or spanning tree. However, certain embodiments can function as a downstream server by proxying the server's interface credentials across the appliance.
In host mode the XIMM can act like a NIC for the appliance. As shown in
In this case the host can configure this interface as any other network interface, with the host stack will handling all/any of ARP, DHCP, etc.
Host mode can contribute to the management of the interfaces, general stack support and handling of unknown traffic.
Host Mode with Redirection (Split Stack)
The CE will be running an IP stack in user space (Iwip) to facilitate packet processing for the specific session being redirected.
Traffic that is not explicitly redirected to a CE through arbiter programming (i.e., 3556) can pass through to XKD as in Host Mode (and as 3454). Conversely, any session redirected to a CE is typically the responsibility of the CE, so that the XKD will not see any traffic for it.
As shown in
As shown in
In the common environment of many TOR connections, providing the appliance 3700 with a single identity (IP address) towards the TOR network is useful. Link bonding on the appliance 3700 and some load sharing/balancing capabilities can be used particularly for stateless applications. For streaming applications, pinned flows to a specific XIMM are improved if certain traffic is directed to specific XIMM ports in order to maintain session integrity and processing. Such flows can be directed to a desired XIMM (3202-0, 3202-1) by giving each XIMM network port (3234-0/1) a unique IP address. Though this requires more management overhead, it does provide an advantage of complete decoupling from the existing network infrastructure. That identity could be a unique identity or a proxy for a directly connected server.
Another aspect to consider is the ease of integration and deployment of the appliance 3700 onto existing racks. Connecting each server (3754-0/1) to the appliance 3700 and accessing that server port through the appliance (without integrating with the switching or routing domains) can involve extension or masquerade of the server port across the appliance.
In one embodiment, an appliance configured for efficient operation of Hadoop or other map/reduce data processing operations can connect to all the servers on the rack, with any remaining ports connecting to the TOR switch. Connection options can range from a 1 to 1 mapping of server ports to TOR ports, to embodiments with a few to 1 mapping of server ports to TOR ports.
In this case, the network interface instance of the TOR XIMM 3202-N can support proxy-ARP (address resolution protocol) for the servers it is masquerading for. Configuration on the appliance 3700 can include (1) mapping server XIMMs (3202-0/1) to a TOR XIMM (3202-N); (2) providing server addressing information to TOR XIMM 3202-N; (3) configuring TOR XIMM 3202-N interface to proxy-ARP for server(s) address(es); (4) establishing any session redirection that the TOR XIMM will terminate; and (5) establishing a pass-through path from TOR 3719 to each server XIMM (3202-0/1) for non-redirected network traffic.
Referring still to
Referring to
Approaches to this implementation can vary, depending on whether streaming mode is supported on the XIMM 3202 or each CE/SE 3866 is arranged to operate autonomously. In the latter case, as shown in
Alternatively, operation in a streaming mode can be enabled by extending the Host model previously described with a split stack functionality. In this case, for each CE/SE 3866 an interface on the host is instantiated to handle the main network stack functionality. Only sessions specifically configured for processing on the CE/SE would be redirected to them and programmed on the arbiter.
Additionally, referring to
As shown in
It should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
It is also understood that the embodiments of the invention may be practiced in the absence of an element and/or step not specifically disclosed. That is, an inventive feature of the invention may be elimination of an element.
Accordingly, while the various aspects of the particular embodiments set forth herein have been described in detail, the present invention could be subject to various changes, substitutions, and alterations without departing from the spirit and scope of the invention.
This application is a continuation of Patent Cooperation Treaty (PCT) Application No. PCT/US2015/023730 filed Mar. 31, 2015 which claims the benefit of U.S. Provisional Patent Application No. 61/973,205 filed Mar. 31, 2014 and a continuation of PCT Application No. PCT/US2015/023746 which claims the benefit of U.S. Provisional Patent Applications No. 61/973,207 filed Mar. 31, 2014 and No. 61/976,471 filed Apr. 7, 2014, the contents all of which are incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
61973205 | Mar 2014 | US | |
61973207 | Mar 2014 | US | |
61976471 | Apr 2014 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US15/23730 | Mar 2015 | US |
Child | 15283287 | US | |
Parent | PCT/US15/23746 | Mar 2015 | US |
Child | PCT/US15/23730 | US |