Advances in networking technology have led to the use of computer networks for a wide variety of applications, such as sending and receiving electronic mail, browsing Internet web pages, exchanging business data, and the like. As the use of computer networks proliferates, the technology upon which these networks are based has become increasingly complex.
Data is typically sent over a network in small packages called “packets,” which may be typically routed over a variety of intermediate network nodes before reaching their destination. These intermediate nodes (e.g., routers, switches, and the like) are often complex computer systems in their own right, and may include a variety of specialized hardware and software components.
For example, some network nodes may include one or more network processors for processing packets for use by higher-level applications. Network processors are typically comprised of a variety of components, including one or more processing units, memory units, buses, controllers, and the like.
A network processor will often be called upon to process packets corresponding to many different data streams. To do this, the network processor may process multiple streams in parallel, and may also be operable to switch between different stream contexts by storing the current processing state for a given stream, processing another stream or performing some other task, then restoring the processing context associated with the original data stream and resuming processing of that stream. The faster the network processor is able to perform its processing tasks, the faster the data streams that the network processor is handling will reach their destination, and the faster any business processes that rely on the data streams will be completed.
Reference will be made to the following drawings, in which:
Systems and methods are disclosed for performing selective caching. It should be appreciated that these systems and methods can be implemented in numerous ways, several examples of which are described below. The following description is presented to enable any person skilled in the art to make and use the inventive body of work. The general principles defined herein may be applied to other embodiments and applications. Descriptions of specific embodiments and applications are thus provided only as examples, and various modifications will be readily apparent to those skilled in the art. For example, although several examples are provided in the context of Intel® Internet Exchange network processors, it will be appreciated that the same principles can be readily applied in other contexts as well. Accordingly, the following description is to be accorded the widest scope, encompassing numerous alternatives, modifications, and equivalents. For purposes of clarity, technical material that is known in the art has not been described in detail so as not to unnecessarily obscure the inventive body of work.
Network processors are used to perform packet processing and other networking operations. An example of a network processor 100 is shown in
Network processor 100 may also feature a variety of interfaces that carry packets between network processor 100 and other network components. For example, network processor 100 may include a switch fabric interface 102 (e.g., a Common Switch Interface (CSIX)) for transmitting packets to other processor(s) or circuitry connected to the fabric; a media interface 105 (e.g., a System Packet Interface Level 4 (SPI-4) interface) that enables network processor 100 to communicate with physical layer and/or link layer devices; an interface 108 (e.g., a Peripheral Component Interconnect (PCI) bus interface) for communicating with a host; and/or the like.
Network processor 100 may also include other components shared by the microengines 104 and/or core processor 110, such as one or more static random access memory (SRAM) controllers 112, dynamic random access memory (DRAM) controllers 106, a hash engine 101, and a relatively low-latency, on-chip scratch pad memory 103 for storing frequently used data. One or more internal buses 114 are used to facilitate communication between the various components of the system.
It will be appreciated that
As previously indicated, microengines 104 may, for example, comprise multi-threaded RISC engines having self-contained instruction and data memory to enable rapid access to locally stored code and data. Microengines 104 may also include one or more hardware-based coprocessors for performing specialized functions such as serialization, cyclic redundancy checking (CRC), cryptography, High-Level Data Link Control (HDLC) bit stuffing, and/or the like. The multi-threading capability of the microengines 104 may be supported by hardware that reserves different registers for different threads and can quickly swap thread contexts. The microengines 104 may communicate with neighboring microengines 104 via, e.g., shared memory and/or neighbor registers that are wired to adjacent engine(s).
In a system such as that described above, each microengine may be responsible for processing a large number of different connections at a single time. As a microengine switches back and forth between connections, it will often need to retrieve information regarding connections that was previously stored. All of this data may be stored in static or dynamic random access memory (SRAM or DRAM); however, retrieving data from SRAM or DRAM will generally be relatively time-consuming.
Thus, in one embodiment, the microengines make use of caching techniques to maintain a local store of the most frequently used data, thereby increasing each microengine's processing efficiency by decreasing the average amount of the time needed to retrieve data that has been previously stored.
Caching is used in many hardware and software contexts, and generally refers to the use of a relatively small amount of relatively fast memory to store data that is frequently used. By storing frequently used data in a low-latency cache, the number of times a processor must access data from relatively slow (high latency) sources like external SRAM or DRAM is reduced.
Caches typically have only a limited storage capacity, since they are often integrated with the processor itself. Thus, while ideally all data needed by the processor would be stored in cache, in reality this is impractical, since providing a cache of this size would be prohibitively expensive and/or infeasible given a typical processor's size constraints. Thus, techniques are needed for using the cache's limited amount of memory most effectively.
Several such techniques are presented below. In one embodiment, the microengines of a network processor such as that shown in
As previously indicated, one way to improve this probability is to increase M by, e.g., increasing the size of the cache. However, this will often be impractical, and, in any event, there will ultimately be some limit on how large M can be.
Thus, in one embodiment a selective caching technique is used to increase the probability of a desired piece of data being found in the cache (i.e., a “cache hit”) without changing M, which is assumed to be fixed. In accordance with this technique, data are only cached if certain criteria are satisfied, rather than blindly caching all data. The criteria can be based on patterns observed in the incoming data, and/or on other characteristics of the data and/or its context.
As previously indicated, a microengine will often receive data from multiple data streams or “pipes.” The higher the capacity of a data pipe, the greater the probability of receiving data on that pipe. Thus, in one embodiment a selective caching technique is used that only caches data associated with pipes having at least a minimum capacity (or bandwidth), C. Data received from pipes with a capacity less than C are not cached, but are instead dynamically loaded from, or sent to, relatively slow memory such as SRAM, taking care to maintain the atomicity of these operations if multiple contexts are acting on the data pipes in parallel. An advantage of this approach is that the cache is only used to store data that is most in need of caching.
The probability that a given piece of data will be found in the cache can be computed in the following manner. Assume that there are K data elements that are associated with data streams having a capacity greater than C (i.e., there are K cacheable data elements). If there is a total of N data elements, then N—K data elements are not cacheable. On average, if it is assumed that the capacity or data rate of the pipes associated with the K cacheable data elements is R times that of the N—K non-cacheable data elements, then the probability of a cache hit will be equal to the probability that a given piece of data is cacheable, multiplied by the probability that a given piece of data is in the cache. Namely, the probability of a cache hit, P(hit) is given by the following equation:
This can be rearranged to yield:
The factor (R * N)/((R−1) * K+N) will be referred to as the cache hit multiplier, and can be tuned to be greater than 1 by carefully selecting K, where R is, by design, assumed to be greater than one. For example if R=2 and K=N/2, the value of the cache hit multiplier will be 4/3, representing a gain of 33%. That is, the cache hit multiplier represents the amount by which the probability of a cache hit is increased over the probability (i.e., M/N) that would obtain if selective caching were not used.
Thus, by optimizing the value of the cache hit multiplier, the average memory access time is decreased from that which would be achievable using a conventional caching algorithm, in which all data is cached without regard to the capacity of the data pipe with which it is associated. Moreover, by selectively caching based on data pipe capacity, it is possible to achieve better hit rates for the same number of cache entries. Put differently, the use of selective caching reduces the cache memory requirements for achieving a given hit rate.
A processor 204 (such as a microengine 104 in
In the example shown in
If, on the other hand, the requested data element corresponds to a low capacity data pipe, it is simply retrieved from memory 202, and provided to the processor for further processing, with no changes being made to the cache.
In the example shown in
It will be appreciated that
If, on the other hand, the data retrieved from the DRAM 202 meets the caching criteria (i.e., a “Yes” exit from block 316), then the cache entry is synchronized with the corresponding DRAM entry (block 318), the data is stored in the cache, and the CAM is updated accordingly (block 320). It will be appreciated that in some embodiments some or all of the actions shown in blocks 318 and 320 can be performed in the background (i.e., they need not be performed at runtime before use is made of the data read from the DRAM).
It will be appreciated that
The systems and methods described above can be used in a variety of computer systems. For example, without limitation, the circuitry and techniques shown in
Individual line cards 400 may include one or more physical layer devices 402 (e.g., optical, wire, and/or wireless) that handle communication over network connections. The physical layer devices 402 translate the physical signals carried by different network media into the bits (e.g., 1s and 0s) used by digital systems. The line cards 400 may also include framer devices 404 (e.g., Ethernet, Synchronous Optic Network (SONET), and/or High-Level Data Link Control (HDLC) framers, and/or other “layer 2” devices) that can perform operations such as error detection and/or correction on frames of data. The line cards 400 may also include one or more network processors 406 (such as network processor 100 in
While
Thus, while several embodiments are described and illustrated herein, it will be appreciated that they are merely illustrative. Other embodiments are within the scope of the following claims.