Today's memory and computation resources of data centers are struggling to keep up with greater data and bandwidth needs, such as for big data and machine learning applications. Although caching techniques such as memory access prediction and data prefetch have been used in a single device with a Central Processing Unit (CPU) and main memory, such techniques have not been developed for distributed caches where cache lines would be accessed by different processing nodes from one or more memory nodes over an Ethernet fabric. Conventional network latencies in transferring data between processing nodes and memory nodes have generally limited the use of such distributed caches.
However, the emergence of high-performance networking (e.g., 100 Gb/s per link and 6.4 Tbit/s aggregate throughput) using Software Defined Networking (SDN) means that the network may no longer be the performance bottleneck in implementing a distributed cache on a network. In this regard, the data transfer latency of conventional fixed-function networking, as opposed to more recent SDN, can be three orders of magnitude greater than typical memory device data access latencies. For example, data transfer latencies with conventional fixed-function networking is typically in terms of hundreds of microseconds, as compared to data access latencies in terms of hundreds of nanoseconds for memory devices such as Dynamic Random Access Memory (DRAM) or Static Random Access Memory (SRAM).
Although newer high-performance networking may provide an acceptable fabric for a distributed cache, challenges remain in maintaining the coherency of copies of data at different processing and memory nodes in the distributed cache system. In addition, there remain problems with interoperability of different types of processing and memory nodes and fault tolerance in a network fabric, such as an Ethernet fabric, where hardware or link failures can cause system unavailability, packet drop, reordering, or duplication, as compared to the ordered and reliable interconnect or bus communication for a conventional cache used by a CPU and a main memory in a single device.
The features and advantages of the embodiments of the present disclosure will become more apparent from the detailed description set forth below when taken in conjunction with the drawings. The drawings and the associated descriptions are provided to illustrate embodiments of the disclosure and not to limit the scope of what is claimed.
In the following detailed description, numerous specific details are set forth to provide a full understanding of the present disclosure. It will be apparent, however, to one of ordinary skill in the art that the various embodiments disclosed may be practiced without some of these specific details. In other instances, well-known structures and techniques have not been shown in detail to avoid unnecessarily obscuring the various embodiments.
Network 112 can include, for example, a Storage Area Network (SAN), a Local Area Network (LAN), and/or a Wide Area Network (WAN), such as the Internet. In this regard, one or more of clients 114, SDN controller 102, and/or one or more of server racks 101 may not be physically co-located. Server racks 101, SDN controller 102, and clients 114 may communicate using one or more standards such as, for example, Ethernet, Fibre Channel, and/or InfiniBand.
As shown in the example of
Software Defined Networking (SDN) controller 102 communicates with each of the programmable switches 104 in system 100. As discussed in more detail below, SDN controller 102 can ensure that a global cache directory maintained at SDN controller 102 and local cache directories maintained at programmable switches 104 (e.g., cache directories 12A, 12B, 12C, and 12D) are consistent. Those of ordinary skill in the art will appreciate that other implementations may include a different number or arrangement of memory devices 110, programmable switches 104, or server racks 101 than shown in the example of
Programmable switches 104A, 104B, 104C, and 104D route memory messages, such as put requests, get requests, and other communications between clients 114 and memory devices 110. For example, such memory messages may include a get request for a specific memory address or a permission level request for a client to modify a cache line requested from a memory device. As discussed in more detail below with reference to the examples of
In some implementations, programmable switches 104 can include, for example, a switch that can be programmed to handle different custom protocols. As discussed in more detail below with reference to
Data planes 106 of programmable switches 104 in the example of
Data planes 106 of programmable switches 104 are programmable and separate from higher-level control planes 108 that determine end-to-end routes for packets between devices in system 100. In this regard, control planes 108 may be used for handling different processes, such as some of the processes in
In one example, programmable switches 104 can be 64 port Top of Rack (ToR) P4 programmable switches, such as a Barefoot Networks Tofino Application Specific Integrated Circuit (ASIC) with ports configured to provide 40 Gigabit Ethernet (GE) frame rates. Other types of programmable switches that can be used as a programmable switch can include, for example, a Cavium Xpliant programmable switch or a Broadcom Trident 3 programmable switch.
The use of a programmable switch allows for the configuration of high-performance and scalable memory centric architectures by defining customized packet formats and processing behavior, such as those discussed below with reference to
SDN controller 102 provides global cache coherency monitoring and control among programmable switches 104 in managing the distributed cache stored in memory devices 110. Each programmable switch 104 can provide centralized data coherency management for the data stored in the memory devices of its respective server rack 101. As discussed in more detail below, each programmable switch 104 can efficiently update a local cache directory 12 for memory devices 110 that it communicates with as cache line requests are received by the programmable switch 104. The limitation of cache directory 12 to the memory devices 110 that communicate with the programmable switch 104 can also improve the scalability of the distributed cache or ability to expand the size of the distributed cache to new memory devices, such as by adding a new server rack with its own programmable switches and memory devices.
In some implementations, memory devices 110 can include, for example, Storage Class Memories (SCMs) or other types of memory, such as Dynamic Random Access Memory (DRAM) or Static RAM (SRAM), that can store and retrieve data at a byte-addressable size or cache line size, as opposed to a page or block size as in storage devices such as Solid-State Drives (SSDs) or Hard Disk Drives (HDDs). SCMs can include, for example, Chalcogenide RAM (C-RAM), Phase Change Memory (PCM), Programmable Metallization Cell RAM (PMC-RAM or PMCm), Ovonic Unified Memory (OUM), Resistive RAM (RRAM), Ferroelectric Memory (FeRAM), Magnetoresistive RAM (MRAM), 3D-XPoint memory, and/or solid-state memory, such as non-volatile NAND memory. Recently developed SCMs can provide non-volatile storage with a fine granularity of access (i.e., byte-addressable or cache line level) and a shorter data access latency, as compared to storage devices, such as an SSD using conventional flash memory or an HDD using a rotating magnetic disk.
As will be appreciated by those of ordinary skill in the art, system 100 may include additional devices or a different number of devices than shown in the example of
Processor 116A can execute instructions, such as instructions from distributed cache module 16A, and application(s) 18A, which may include an Operating System (OS) and/or other applications used by client 114A. Processor 116A can include circuitry such as a Central Processing Unit (CPU), Graphics Processing Unit (GPU), microcontroller, a Digital Signal Processor (DSP), an Application-Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), hard-wired logic, analog circuitry and/or a combination thereof. In some implementations, processor 116A can include a System on a Chip (SoC), which may be combined with one or both of memory 118A and interface 122A. Processor 116A can include one or more cache levels (e.g., L1, L2, and/or L3 caches) where data is loaded from or flushed into memory 118A, or loaded from or flushed into memory devices 110, such as memory device 1101 in
Memory 118A can include, for example, a volatile RAM such as SRAM, DRAM, a non-volatile RAM, or other solid-state memory that is used by processor 116A as an internal main memory to store data. Data stored in memory 118A can include data read from storage device 120A, data to be stored in storage device 120A, instructions loaded from distributed cache module 16A or application(s) 18A for execution by processor 116A, and/or data used in executing such applications. In addition to loading data from internal main memory 118A, processor 116A also loads data from memory devices 110 as an external main memory or distributed cache. Such data may also be flushed after modification by processor 116A or evicted without modification back into internal main memory 118A or an external main memory device 110 via programmable switch 104A or programmable switch 104B.
As shown in
Storage device 120A serves as secondary storage that can include, for example, one or more rotating magnetic disks or non-volatile solid-state memory, such as flash memory. While the description herein refers to solid-state memory generally, it is understood that solid-state memory may comprise one or more of various types of memory devices such as flash integrated circuits, NAND memory (e.g., single-level cell (SLC) memory, multi-level cell (MLC) memory (i.e., two or more levels), or any combination thereof), NOR memory, EEPROM, other discrete Non-Volatile Memory (NVM) chips, or any combination thereof. As noted above internal main memory 118A and external memory devices 110 typically provide faster data access and can provide more granular data access (e.g., cache line size or byte-addressable) than storage device 120A.
Interface 122A is configured to interface client 114A with devices in system 100, such as programmable switches 104A and 104B. Interface 122A may communicate using a standard such as, for example, Ethernet, Fibre Channel, or InfiniBand. In this regard, client 114A, programmable switches 104A and 104B, SDN controller 102, and memory device 1101 may not be physically co-located and may communicate over a network such as a LAN or a WAN. As will be appreciated by those of ordinary skill in the art, interface 122A can be included as part of processor 116A.
Programmable switches 104A and 104B in some implementations can be ToR switches for server rack 1011 including memory device 1101. In the example of
Memory 134 of a programmable switch 104 can include, for example, a volatile RAM such as DRAM, or a non-volatile RAM or other solid-state memory such as register arrays that are used by circuitry 132 to execute instructions loaded from switch cache module 26 or firmware of the programmable switch 104, and/or data used in executing such instructions, such as cache directory 12. In this regard, and as discussed in more detail below, switch cache module 26 can include instructions for implementing processes such as those discussed with reference to
In the example of
SDN controller 102 in the example of
Processor 124 of SDN controller 102 executes cache controller module 22 to maintain global cache directory 20 and update local cache directories 12 at programmable switches 104, as needed. Processor 124 can include circuitry such as a CPU, GPU, microcontroller, a DSP, an ASIC, an FPGA, hard-wired logic, analog circuitry and/or a combination thereof. In some implementations, processor 124 can include an SoC, which may be combined with one or both of memory 126 and interface 128. Memory 126 can include, for example, a volatile RAM such as DRAM, a non-volatile RAM, or other solid-state memory that is used by processor 124 to store data. SDN controller 102 communicates with programmable switches 104 via interface 128, which is configured to interface with ports of programmable switches 104A and 104B, and may interface according to a standard, such as Ethernet, Fibre Channel, or InfiniBand.
As will be appreciated by those of ordinary skill in the art, other implementations may include a different arrangement or number of components, or modules than shown in the example of
In the example of
As noted above, memory messages can have a custom packet format so that programmable switch 104A can distinguish memory messages, such as messages for cache line addressed data, from other network traffic, such as messages for page addressed data. The indication of a memory message, such as a cache line request to put or get cache line data, causes circuitry 132A of programmable switch 104A to handle the packet differently from other packets that are not indicated as being a memory message. In some implementations, the custom packet format fits into a standard 802.3 Layer 1 frame format, which can allow the packets to operate with existing and forthcoming programmable switches, such as a Barefoot Tofino ASIC switch, for example. In such an implementation, the preamble, start frame delimiter, and interpacket gap may follow the standard 802.3 Layer 1 frame format, but portions in Layer 2 are replaced with custom header fields that can be parsed by programmable switch 104A. A payload of a packet for a memory message can include one or more memory addresses for one or more cache lines being requested by a client or being returned to a client, and may include data for the cache line or lines.
Stages 362 and 363 can include, for example programmable Arithmetic Logic Units (ALUs) and one or more memories that store match-action tables for matching extracted values from the headers and performing different corresponding actions based on the values, such as performing particular updates to cache directory 12A stored in memory 134A of programmable switch 104A. In some implementations, the stages of the ingress pipeline and the egress pipeline may share a single memory, such as memory 134A in
Traffic manager 38 routes the cache line request to an appropriate port of programmable switch 104A. As discussed in more detail in co-pending application Ser. No. 16/548,116, entitled “DISTRIBUTED CACHE WITH IN-NETWORK PREFETCH”, filed on Aug. 22, 2019, and incorporated by reference above, the ingress pipeline in some implementations may calculate offsets for additional cache lines to be prefetched based on the parsed header fields, and then generates corresponding additional cache line request packets using a packet generation engine of programmable switch 104A.
In the example of
As will be appreciated by those of ordinary skill in the art, other implementations may include a different arrangement of modules for a programmable switch. For example, other implementations may include more or less stages as part of the ingress or egress pipeline.
If the incoming message is a cache line memory message, such as a get or a put cache line request to retrieve or store a cache line, respectively, ingress pipeline 36 determines whether the cache line memory message is a read request or a write request. As discussed in the example header format of
If the incoming message is not a cache line memory message, such as a read or write command in units greater than a cache line size (e.g., in a page or block size), the message or portions of the message, such as a header and a payload, are passed to traffic manager 38, which can determine a port for sending the message. In some implementation, a destination address in the header can indicate a port to send the message via egress pipeline 40, which may reassemble the message before sending the message to another device in system 100.
In the case where the incoming message is a cache line memory message, match-action tables of one or more of stages 362 and 363 may be used to determine a memory device 110 storing the requested cache line or cache lines. In this regard, the memory device 110 may serve as a home node or serialization point for the cache lines it stores by allowing access and granting permission levels for modification of the cache lines to other nodes or devices in system 100. Traffic manager 38 can determine a port for sending the cache line request to the identified memory device 110 storing the requested cache line.
In the cases of a read miss or a write miss, egress pipeline 40 including deparser 403 reassembles or builds one or more packets for the cache line request and sends it to the identified memory device 110. In the cases, of a read hit or a write hit, one or more of egress stages 401 and 402 may be used to update cache directory 12. In some examples, a status or permission level, and/or a version number may be changed in cache directory 12 for an entry corresponding to the requested cache line. The read request may be reassembled or built by deparser 403, and sent to the identified memory device 110 storing the requested data.
In the example of a write request, egress pipeline 40 may use one or more of egress stages 401 and 402 to identify other nodes or devices in system 100 storing a copy of the requested cache line or lines and a status or permission level for the requested data. In such examples, egress pipeline 40 may also send cache line requests to the other nodes or devices to change a status or permission level of such other nodes. For example, a request to modify a cache line that is being shared by multiple nodes in addition to the memory device 110 storing the cache line can result in egress pipeline 40 sending cache line requests to the other nodes to change their permission level from shared to invalid for the cache line requested from memory device 110.
As will be appreciated by those of ordinary skill in the art, other arrangements of operations performed by programmable switch 104 are possible than those shown in the example of
As shown in
In some cases, an address or other indicator of the memory device 110 storing the cache line may be included as part of the address for the cache line. As shown in the example of
In this regard, different devices in a system implementing a distributed cache may not be exactly synchronized with each other. In some implementations, this challenge is overcome by using the time provided by the home memory device 110 that stores the requested data. Programmable switch 104 may receive this time in a cache line memory message from memory device 110 with the requested data. The use of the home memory device 110 that stores the requested data as the serialization point or timekeeper for the requested data can provide a consistent timestamp for the requested data and allow for scalability of the distributed cache without having to synchronize timekeeping among an increasing number of devices at a central location.
In the example of cache directory 12A in
The cache line indicated by address C in cache directory 12A is stored in memory device 1102, and has shared read-only copies of the cache line stored at clients 114A and 114B. The cache line has been modified twice since it was originally stored in memory device 1102, and was last modified or authorized to be modified by its home memory device 1102 at the time indicated by the corresponding timestamp in cache directory 12A.
As shown in
As will be appreciated by those of ordinary skill in the art, cache directory 12A may include different information than that shown in
In
For its part, memory device 1101 receives the cache line request from client 114A and either maintains a shared permission level (i.e., S in memory device 1101) with respect to the requested data or changes its permission level with respect to the requested data from exclusive to shared (i.e., E to S in
In the bottom half of
As noted above, the present disclosure uses programmable switch 104 to maintain the cache directory for its respective memory devices 110. This ordinarily provides an efficient way to maintain cache directories 12 for the distributed cache, since programmable switch 104 serves as an intermediary or centralized location for communication between clients 114 and its memory devices 110. Programmable switch 104 can update its cache directory 12 based on the cache line requests it receives for memory devices 110 without having to coordinate among a larger number of caches located at a greater number of clients 114 or memory devices 110. Using programmable switch 104 to update a local cache directory also improves scalability of the distributed cache, since, in certain implementations, each programmable switch is responsible for only the cache lines stored in its associated set of memory devices 110.
The top right example state diagram of
The bottom example state diagram in
Memory device 1101 then sends the requested data to client 114A and grants permission to client 114A to modify the data. The status of memory device 1101 with respect to the requested data changes from shared to invalid, while the status of client 114A with respect to the requested data changes from either invalid to exclusive or shared to exclusive, depending on whether client 114A was previously sharing the data with clients 114B and 114C. In cases where client 114A already was sharing the requested data, memory device 1101 may only send an indication that the permission level of client 114A can be changed from shared to exclusive, since client 114A already has a copy of the requested data.
In the example state diagram on the right side of
As discussed above, memory device 110 in the foregoing examples serves as a serialization point for the modification of the data it stores. In other words, the order of performing requests for the same data is typically in the order that memory device 110 receives requests for the data. In addition, memory device 110 uses a non-blocking approach where cache line requests are granted in the order in which they are received. In some implementations, programmable switch 104 may delay additional requests received for data that is in progress of being modified and/or may send a request for a modified copy of the cache line to the client that has modified the data without having to wait for a request from memory device 110 to retrieve the modified data from the client. These features are discussed in more detail below with reference to the cache line request conflict process of
In the example of
The OpCode field can indicate an operation type for an intended operation to be performed using a requested cache line or cache lines such as acquire to read or acquire to read and write. In other cases, the OpCode field can indicate whether the packet is a probe to change the permission level of a client or a probe acknowledgment to indicate that a permission level has been changed, as discussed above with reference to the example state diagrams of
The size field of header 30 can indicate the size of the data requested (e.g., a number of cache lines or a size in bytes) or the size of the data provided in payload 32. The domain field can provide coherence message ordering guarantees within a subset of messages, and the source field can indicate a source identifier or other identifier for the device that issued the request.
Payload 32 can include, for example, an address or addresses for one or more cache lines that are requested from a programmable switch 104 or may include data for one or more cache lines being returned to a client 114 from a programmable switch 104. In the example of
As will be appreciated by those of ordinary skill in the art, other message formats can be used with programmable switches 104 to perform cache line requests and update cache directories 12.
In block 702, programmable switch 104 receives a cache line request from a client 114 on network 112 to obtain one or more cache lines for performing an operation by the client. As discussed above, the cache line is a size of data that can be used by a processor of the requesting client that would otherwise be accessed from a local main memory of the client in a conventional system. In some implementations, programmable switch 104 may also perform a prefetch process to expand the requested cache line or cache lines to include additional, unrequested cache lines that are predicted to be needed by the requesting client based at least in part on the cache line or lines being requested by the client. Examples of such prefetch processes are provided in co-pending application Ser. No. 16/548,116, entitled “DISTRIBUTED CACHE WITH IN-NETWORK PREFETCH”, filed on Aug. 22, 2019, and incorporated by reference above.
In block 704, programmable switch 104 identifies a port of the plurality of ports 130 of programmable switch 104 for communicating with a memory device 110 storing the requested cache line or cache lines. As discussed above with reference to
In block 706, programmable switch 104 updates its local cache directory 12 for the distributed cache based on the received cache line request. For example, an egress stage of egress pipeline 40 (e.g., stage 401 or 402 in
Although the example discussed above with reference to
In block 708, a cache line request is sent to memory device 110 using the port identified in block 704. As discussed above with the example of
In addition, programmable switch 104 may wait to receive a confirmation or return message from memory device 110 indicating that the cache line request has been received by memory device 110. After a timeout period, programmable switch 104 may resend the cache line request, which can improve fault tolerance for dropped packets.
As discussed below with reference to
In block 804, programmable switch 104 determines whether the cache line request is for modifying the one or more cache lines. In some implementations, an ingress stage of programmable switch 104 may compare an OpCode field in the header to a particular OpCode indicating a request to modify the requested data. The determination of whether the cache line request is to modify the data may affect how programmable switch 104 updates cache directory 12 and/or if programmable switch 104 should perform other operations to manage conflicting requests to modify the same data.
In block 806, programmable switch 104 updates its local cache directory 12 based on the cache line request received in block 802. In some implementations, one or more egress stages may perform the update. For example, an egress stage of egress pipeline 40 (e.g., stage 401 or 402 in
In block 808, programmable switch 104 sends the reassembled cache line request to the memory device 110 serving as a home node that stores the requested data. A traffic manager of programmable switch 104 may identify a port for the memory device 110 and a deparser of programmable switch 104 may reassemble a previously extracted header and payload to form the cache line request to be sent to memory device 110. In some implementations, the ingress pipeline and/or egress pipeline of programmable switch 104 may perform additional operations, such as error checking, timestamping the cache line request, and/or identifying additional cache lines to request as part of a prefetch process.
In addition, programmable switch 104 may wait to receive a confirmation or return message from memory device 110 indicating that the cache line request has been received by memory device 110. After a timeout period, programmable switch 104 may resend the cache line request, which can improve fault tolerance for dropped packets.
In block 810, programmable switch 104 optionally sends one or more indications of updates to its local cache directory 12 to another programmable switch 104 and/or SDN controller 102 to update mirrored cache directories. Such updates can include, for example, a new version number, other nodes that may store a copy of the data, and/or timestamps for when the data was modified or authorized to be modified.
As discussed above, the mirroring of cache directory 12 at another programmable switch 104 and/or SDN controller 102 can improve the fault tolerance or redundancy of the distributed cache. If the cache directory becomes unavailable at programmable switch 104, such as due to a power loss or removal of programmable switch 104 from its server rack 101, the other programmable switch for the rack or SDN controller 102 can be used. As with the sending of a cache line request to a memory device 110, a packet retransmission may also be used when sending indications of updates to cache directory 12 to other programmable switches and/or SDN controller 102.
In block 902, programmable switch 104 receives an additional cache line request from a client 114 that is different from another client 114 that sent a previous cache line request to obtain one or more cache lines. In block 904, programmable switch 104 checks the status of the one or more cache lines being requested by the additional cache line request to determine if the requested cache line or lines are in progress of being modified. An ingress or egress pipeline may check cache directory 12 to determine if a status in the cache directory 12 for one or more entries corresponding to the cache line or lines indicate that the data is being modified.
If it is determined that the cache line or lines are in the process of being modified, programmable switch 104 in block 906 sends a new cache line request for the modified version of the cache line or lines to the previous client 114 to return the modified cache line or lines to memory device 110. In such cases, time and the use of processing resources of memory device 110 can be conserved by not having memory device 110 prepare and send the new request for the modified data back to programmable switch 104.
If it is instead determined in block 904 that the cache line or lines are not in the process of being modified, programmable switch 104 in block 908 sends the additional cache line request to memory device 110 to obtain the requested data. As discussed above with reference to the example state diagrams of
As will be appreciated by those of ordinary skill in the art, other embodiments of programmable switch 104 may not send a new cache line request for the modified data as in the cache line request conflict process of
In block 1004, programmable switch 104 sends version numbers for cache lines that have been modified since a last request received from SDN controller 102 for version numbers. Programmable switch 104 can use information in its local cache directory 12, such as version numbers and/or timestamps to identify changes to data since a previous request from SDN controller 102. As with other messages sent by programmable switch 104, a timeout period for receiving a confirmation from SDN controller 102 can be used to ensure that the updates are received by SDN controller 102.
As discussed above, the foregoing use of a centralized programmable switch to maintain a local cache directory can allow for a distributed cache with coherent cache lines throughout the distributed cache. In addition, limiting the local cache directory of the programmable switch to the memory devices in communication with the programmable switch, such as to memory devices in the same server rack, can allow for scalability of the distributed cache. The use of a home memory device to act as a serialization point and a time synchronization point for the cache lines stored in the memory device further promote scalability of the distributed cache. The foregoing arrangements of mirrored cache directories and packet retransmission also improve the fault tolerance of the distributed cache.
Those of ordinary skill in the art will appreciate that the various illustrative logical blocks, modules, and processes described in connection with the examples disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. Furthermore, the foregoing processes can be embodied on a computer readable medium which causes processor or controller circuitry to perform or execute certain functions.
To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, and modules have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Those of ordinary skill in the art may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The various illustrative logical blocks, units, modules, processor circuitry, and controller circuitry described in connection with the examples disclosed herein may be implemented or performed with a general purpose processor, a GPU, a DSP, an ASIC, an FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. Processor or controller circuitry may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, an SoC, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The activities of a method or process described in connection with the examples disclosed herein may be embodied directly in hardware, in a software module executed by processor or controller circuitry, or in a combination of the two. The steps of the method or algorithm may also be performed in an alternate order from those provided in the examples. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable media, an optical media, or any other form of storage medium known in the art. An exemplary storage medium is coupled to processor or controller circuitry such that the processor or controller circuitry can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to processor or controller circuitry. The processor or controller circuitry and the storage medium may reside in an ASIC or an SoC.
The foregoing description of the disclosed example embodiments is provided to enable any person of ordinary skill in the art to make or use the embodiments in the present disclosure. Various modifications to these examples will be readily apparent to those of ordinary skill in the art, and the principles disclosed herein may be applied to other examples without departing from the spirit or scope of the present disclosure. The described embodiments are to be considered in all respects only as illustrative and not restrictive. In addition, the use of language in the form of “at least one of A and B” in the following claims should be understood to mean “only A, only B, or both A and B.”
This application claims the benefit of U.S. Provisional Application No. 62/842,959 entitled “DISTRIBUTED BRANCH PREDICTION WITH IN-NETWORK PREFETCH”, filed on May 3, 2019, which is hereby incorporated by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
6044438 | Olnowich | Mar 2000 | A |
6078997 | Young et al. | Jun 2000 | A |
6108737 | Sharma | Aug 2000 | A |
6209065 | Van Doren | Mar 2001 | B1 |
6230243 | Elko et al. | May 2001 | B1 |
6263404 | Borkenhagen et al. | Jul 2001 | B1 |
6298418 | Fujiwara | Oct 2001 | B1 |
6829683 | Kuskin | Dec 2004 | B1 |
6868439 | Basu et al. | Mar 2005 | B2 |
6954844 | Lentz | Oct 2005 | B2 |
6993630 | Williams | Jan 2006 | B1 |
7032078 | Cypher | Apr 2006 | B2 |
7376799 | Veazey | May 2008 | B2 |
7673090 | Kaushik et al. | Mar 2010 | B2 |
7716425 | Uysal et al. | May 2010 | B1 |
7975025 | Szabo et al. | Jul 2011 | B1 |
8166251 | Luttrell | Apr 2012 | B2 |
8281075 | Arimilli | Oct 2012 | B2 |
9313604 | Holcombe | Apr 2016 | B1 |
9442850 | Rangarajan et al. | Sep 2016 | B1 |
9819739 | Hussain et al. | Nov 2017 | B2 |
9825862 | Bosshart | Nov 2017 | B2 |
9826071 | Bosshart | Nov 2017 | B2 |
9880768 | Bosshart | Jan 2018 | B2 |
9910615 | Bosshart | Mar 2018 | B2 |
9912610 | Bosshart et al. | Mar 2018 | B2 |
9923816 | Kim et al. | Mar 2018 | B2 |
9936024 | Malwankar et al. | Apr 2018 | B2 |
9940056 | Bosshart | Apr 2018 | B2 |
10038624 | Cruz et al. | Jul 2018 | B1 |
10044583 | Kim et al. | Aug 2018 | B2 |
10050854 | Licking et al. | Aug 2018 | B1 |
10063407 | Kodeboyina et al. | Aug 2018 | B1 |
10063479 | Kim et al. | Aug 2018 | B2 |
10063638 | Huang | Aug 2018 | B2 |
10067967 | Bosshart | Sep 2018 | B1 |
10075567 | Licking et al. | Sep 2018 | B1 |
10078463 | Bosshart | Sep 2018 | B1 |
10084687 | Sharif et al. | Sep 2018 | B1 |
10110454 | Kim et al. | Oct 2018 | B2 |
10127983 | Peterson et al. | Nov 2018 | B1 |
10133499 | Bosshart | Nov 2018 | B2 |
10146527 | Olarig et al. | Dec 2018 | B2 |
10158573 | Lee et al. | Dec 2018 | B1 |
10164829 | Watson et al. | Dec 2018 | B1 |
10169108 | Gou | Jan 2019 | B2 |
10225381 | Bosshart | Mar 2019 | B1 |
10230810 | Bhide et al. | Mar 2019 | B1 |
10237206 | Agrawal et al. | Mar 2019 | B1 |
10257122 | Li et al. | Apr 2019 | B1 |
10268634 | Bosshart et al. | Apr 2019 | B1 |
10298456 | Chang | May 2019 | B1 |
10496566 | Olarig et al. | Dec 2019 | B2 |
10628353 | Prabhakar et al. | Apr 2020 | B2 |
10635316 | Singh et al. | Apr 2020 | B2 |
10761995 | Blaner | Sep 2020 | B2 |
10812388 | Thubert et al. | Oct 2020 | B2 |
20030009637 | Arimilli et al. | Jan 2003 | A1 |
20030028819 | Chiu et al. | Feb 2003 | A1 |
20030158999 | Hauck et al. | Aug 2003 | A1 |
20040044850 | George et al. | Mar 2004 | A1 |
20040260883 | Wallin et al. | Dec 2004 | A1 |
20060265568 | Burton | Nov 2006 | A1 |
20070067382 | Sun | Mar 2007 | A1 |
20080010409 | Rao et al. | Jan 2008 | A1 |
20090240664 | Dinker et al. | Sep 2009 | A1 |
20090240869 | O'Krafka et al. | Sep 2009 | A1 |
20090313503 | Atluri et al. | Dec 2009 | A1 |
20100008260 | Kim et al. | Jan 2010 | A1 |
20100223322 | Mott et al. | Sep 2010 | A1 |
20110004729 | Akkawi et al. | Jan 2011 | A1 |
20110093925 | Krishnamoorthy et al. | Apr 2011 | A1 |
20110238923 | Hooker et al. | Sep 2011 | A1 |
20120110108 | Li et al. | May 2012 | A1 |
20140269716 | Pruss et al. | Sep 2014 | A1 |
20140278575 | Anton et al. | Sep 2014 | A1 |
20140331001 | Liu et al. | Nov 2014 | A1 |
20140362709 | Kashyap | Dec 2014 | A1 |
20150319243 | Hussain et al. | Nov 2015 | A1 |
20150378919 | Anantaraman et al. | Dec 2015 | A1 |
20160099872 | Kim et al. | Apr 2016 | A1 |
20160127492 | Malwankar et al. | May 2016 | A1 |
20160156558 | Hong et al. | Jun 2016 | A1 |
20160216913 | Bosshart | Jul 2016 | A1 |
20160246507 | Bosshart | Aug 2016 | A1 |
20160246535 | Bosshart | Aug 2016 | A1 |
20160294451 | Jung et al. | Oct 2016 | A1 |
20160315964 | Shetty | Oct 2016 | A1 |
20170026292 | Smith et al. | Jan 2017 | A1 |
20170054618 | Kim | Feb 2017 | A1 |
20170054619 | Kim | Feb 2017 | A1 |
20170063690 | Bosshart | Mar 2017 | A1 |
20170064047 | Bosshart | Mar 2017 | A1 |
20170093707 | Kim et al. | Mar 2017 | A1 |
20170093986 | Kim et al. | Mar 2017 | A1 |
20170093987 | Kaushalram et al. | Mar 2017 | A1 |
20170187846 | Shalev et al. | Jun 2017 | A1 |
20170214599 | Seo et al. | Jul 2017 | A1 |
20170371790 | Dwiel et al. | Dec 2017 | A1 |
20180173448 | Bosshart | Jun 2018 | A1 |
20180176324 | Kumar et al. | Jun 2018 | A1 |
20180234340 | Kim et al. | Aug 2018 | A1 |
20180234355 | Kim et al. | Aug 2018 | A1 |
20180239551 | Bosshart | Aug 2018 | A1 |
20180242191 | Lundqvist et al. | Aug 2018 | A1 |
20180262459 | Wang et al. | Sep 2018 | A1 |
20180285275 | Barczak et al. | Oct 2018 | A1 |
20180329818 | Cheng et al. | Nov 2018 | A1 |
20180337860 | Kim et al. | Nov 2018 | A1 |
20180349163 | Gao et al. | Dec 2018 | A1 |
20180349285 | Ish et al. | Dec 2018 | A1 |
20190012278 | Sindhu et al. | Jan 2019 | A1 |
20190044878 | Steffen et al. | Feb 2019 | A1 |
20190050333 | Chacon et al. | Feb 2019 | A1 |
20190058646 | Kim et al. | Feb 2019 | A1 |
20190087341 | Pugsley et al. | Mar 2019 | A1 |
20190196987 | Shen et al. | Jun 2019 | A1 |
20190220429 | Ranjan et al. | Jul 2019 | A1 |
20190227921 | Frolikov | Jul 2019 | A1 |
20190370176 | Priyadarshi et al. | Dec 2019 | A1 |
20190391928 | Lin | Dec 2019 | A1 |
20190394261 | DeCusatis et al. | Dec 2019 | A1 |
20200007408 | Siddappa | Jan 2020 | A1 |
20200065269 | Balasubramani et al. | Feb 2020 | A1 |
20200089619 | Hsu et al. | Mar 2020 | A1 |
20200097212 | Lakshman et al. | Mar 2020 | A1 |
20200226068 | Gellerich et al. | Jul 2020 | A1 |
20200250099 | Campbell et al. | Aug 2020 | A1 |
20200313999 | Lee et al. | Oct 2020 | A1 |
20200349080 | Radi et al. | Nov 2020 | A1 |
20210034250 | Mizuno et al. | Feb 2021 | A1 |
20210034270 | Gupta et al. | Feb 2021 | A1 |
20210051751 | Pawar | Feb 2021 | A1 |
20210149807 | Gupta et al. | May 2021 | A1 |
20210173589 | Benisty et al. | Jun 2021 | A1 |
20210194828 | He et al. | Jun 2021 | A1 |
20210294506 | Tadokoro | Sep 2021 | A1 |
Entry |
---|
Hashemi et al.; “Learning Memory Access Patters”; 15 pages; Mar. 6, 2018; available at https://arxiv.org/pdf/1803.02329.pdf. |
Kim et al.; “A Framework for Data Prefetching using Off-line Training of Markovian Predictors”; Sep. 18, 2002; 8 pages; available at https://www.comp.nus.edu.sg/˜wongwf/papers/ICCD2002.pdf. |
Pending U.S. Appl. No. 16/548,116, filed Aug. 22, 2019, entitled “Distributed Cache With In-Network Prefetch”, Marjan Radi et al. |
Li et al.; “Pegasus: Load-Aware Selective Replication with an In-Network Coherence Directory”; Dec. 2018; 15 pages; Technical Report UW-CSE-18-12-01, University of Washington CSE, Seattle, WA. |
Eisley et al.; “In-Network Cache Coherence”; 2006; pp. 321-332; Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. |
Jin et al.; “NetCache: Balancing Key-Value Stores with Fast In-Network Caching”; Oct. 28, 2017; pp. 121-136; Proceedings of the 26th Symposium on Operating Systems Principles. |
Liu et al.; “IncBricks: Toward In-Network Computation with an In-Network Cache”; Apr. 2017; pp. 795-809; ACM SIGOPS Operating Systems Review 51, Jul. 26, No. 2. |
Vestin et al.; “FastReact: In-Network Control and Caching for Industrial Control Networks using Programmable Data Planes”; Aug. 21, 2018; pp. 219-226; IEEE 23rd International Conference on Emerging Technologies and Factory Automation (ETFA). vol. 1. |
Leslie Lamport; “Paxos Made Simple”; Nov. 1, 2001; available at: https://lamport.azurewebsites.net/pubs/paxos-simple.pdf. |
Paul Krzyzanowski; “Understanding Paxos”; PK.org; Distributed Systems; Nov. 1, 2018; available at: https://www.cs.rutgers.edu/˜pxk/417/notes/paxos.html. |
Wikipedia; Paxos (computer science); accessed on Jun. 27, 2020; available at: https://en.wikipedia.org/wiki/Paxos_(computer_science). |
Pending U.S. Appl. No. 16/916,730, filed Jun. 30, 2020, entitled “Devices and Methods for Failure Detection and Recovery for a Distributed Cache”, Radi et al. |
Ivan Pepelnjak; Introduction to 802.1Qbb (Priority-based Flow Control-PFC); accessed on Jun. 25, 2020; available at: https://gestaltit.com/syndicated/ivan/introduction-802-1qbb-priority-based-flow-control-pfc/. |
Juniper Networks Inc.; Configuring Priority-Based Flow Control for an EX Series Switch (CLI Procedure); Sep. 25, 2019; available at: https://www.juniper.net/documentation/en_US/junos/topics/task/configuration/cos-priority-flow-control-cli-ex-series.html. |
Pending U.S. Appl. No. 16/914,206, filed Jun. 26, 2020, entitled “Devices and Methods for Managing Network Traffic for a Distributed Cache”, Radi et al. |
Cisco White Paper; “Intelligent Buffer Management on Cisco Nexus 9000 Series Switches”; Jun. 6, 2017; 22 pages; available at: https://www.cisco.com/c/en/us/products/collateral/switches/nexus-9000-series-switches/white-paper-c11-738488.html. |
Pending U.S. Appl. No. 17/174,681, filed Feb. 12, 2021, entitled “Devices and Methods for Network Message Sequencing”, Marjan Radi et al. |
Pending U.S. Appl. No. 17/175,449, filed Feb. 12, 2021, entitled “Management of Non-Volatile Memory Express Nodes”, Marjan Radi et al. |
Botelho et al.; “On the Design of Practical Fault-Tolerant SDN Controllers”; Sep. 2014; 6 pages; available at: http://www.di.fc.ul.pt/˜bessani/publications/ewsdn14-ftcontroller.pdf. |
Huynh Tu Dang; “Consensus Protocols Exploiting Network Programmability”; Mar. 2019; 154 pages; available at: https://doc.rero.ch/record/324312/files/2019INFO003.pdf. |
Jialin Li; “Co-Designing Distributed Systems with Programmable Network Hardware”; 2019; 205 pages; available at: https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/44770/Li_washington_0250E_20677.pdf?sequence=1&isAllowed=y. |
Liu et al.; “Circuit Switching Under the Radar with REACToR”; Apr. 2-4, 2014; 16 pages; USENIX; available at: https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-liu_he.pdf. |
Written Opinion dated Feb. 20, 2020 from International Application No. PCT/US2019/068360, 4 pages. |
International Search Report and Written Opinion dated Apr. 27, 2020 from International Application No. PCT/US2019/068269, 6 pages. |
Ibrar et al.; “PrePass-Flow: A Machine Learning based Technique to Minimize ACL Policy Violation Due to Links Failure in Hybrid SDN”; Nov. 20, 2020; Computer Networks; available at https://doi.org/10.1016/j.comnet.2020.107706. |
Saif et al.; “IOscope: A Flexible I/O Tracer for Workloads' I/O Pattern Characterization”; Jan. 25, 2019; International Conference on High Performance Computing; available at https://doi.org/10.1007/978-3-030-02465-9_7. |
Zhang et al.; “PreFix Switch Failure Prediction in Datacenter Networks”; Mar. 2018; Proceedings of the ACM on the Measurement and Analysis of Computing Systems; available at: https://doi.org/10.1145/3179405. |
Pending U.S. Appl. No. 17/353,781, filed Jun. 21, 2021, entitled “In-Network Failure Indication and Recovery”, Marjan Radi et al. |
International Search Report and Written Opinion dated Oct. 28, 2021 from International Application No. PCT/US2021/039070, 7 pages. |
Liu et al.; “DistCache: provable load balancing for large-scale storage systems with distributed caching”; FAST '19 Proceedings of the 17th USENIX Conference on File and Storage Technologies; Feb. 2019; pp. 143-157 (Year 2019). |
Radi et al.; “OmniXtend: direct to caches over commodity fabric”; 2019 IEEE Symposium on High-Performance Interconnects (HOTI); Santa Clara, CA; Aug. 2019; pp. 59-62 (Year 2019). |
Wang et al.; “Concordia: Distributed Shared Memory with In-Network Cache Coherence”; 19th USENIX Conference on File and Storage Technologies; pp. 277-292; Feb. 2021. |
U.S. Appl. No. 62/722,003, filed Aug. 23, 2018, entitled “Independent Datastore in a Network Routing Environment”, Sarkar et al. |
U.S. Appl. No. 62/819,263, filed Mar. 15, 2019, entitled “Providing Scalable and Concurrent File System”, Kohli et al. |
Number | Date | Country | |
---|---|---|---|
20200351370 A1 | Nov 2020 | US |
Number | Date | Country | |
---|---|---|---|
62842959 | May 2019 | US |