Rings (or “circular buffers”) are used to pass messages between agents such as central processing units (“CPUs”), I/O devices and co-processors. The messages may include data, pointers, and any other type of information to be exchanged between such agents. Rings are also used to pass messages between threads or processes running on a single agent. A ring is typically implemented by an array in memory, and a pair of pointers or offsets into that array which increment linearly through the entries in the array and “wrap”, modulo the ring size, when the end of the array is reached. One of these pointers is used to add a new entry onto the tail of the ring. This pointer is referred to as a “produce pointer”. The agent that performs the produce access (or enqueuing) operation using the produce pointer is referred to as the “producer”. The other pointer, referred to as a “consume pointer”, is used to remove entries from the head of the ring. The agent that performs the consume access (or dequeuing) operation is referred to as the “consumer”.
A particular pointer may be owned by a single agent. That agent would maintain a copy of the pointer that is used to determine which entry to read or write in the array. Alternatively, if a pointer is used by multiple producers and/or multiple consumers, each of the agents would require a mutual exclusion mechanism to enable that agent to atomically act on the ring pointers and associated entries.
The produce pointer is both read and written by the producer, which uses it and increments its value. Similarly, the consume pointer is both read and written by the consumer. A variety of mechanisms can be used to enable the consumer to determine if there is space available in the ring and to enable the producer to determine if there are any entries available in the ring. A common mechanism is to compare the consume and produce pointers, with an extra high-order bit in the pointers used to disambiguate a full ring from an empty one when the pointer values are otherwise equal. In this case, the produce pointer is also read by the consumer, and the consume pointer is read by the producer.
An agent sometimes maintains a private local copy of the pointer it owns in order to minimize overhead related to ring accesses. Also, for various reasons, agents may use “batch” notifications of enqueuing to or dequeueing from a ring so as to amortize the overheads for the agents. Thus, a producer might maintain a private copy of the produce pointer for use in writing entries into the ring, as well as a separate public copy of that pointer, updated less frequently, to pass “chunks” of entries to the consumer at one time. A consumer might similarly use a private and public copy of the consume pointer to indicate space being freed up, enabling “lazy” retirement and resource recovery to optimize those functions and to decouple them from servicing the ring. Thus, a private copy and public copy of a pointer serve the different functions of access and notification, respectively.
Having the producer read the consume pointer and the consumer read the produce pointer to determine ring status contributes to communications overhead. To minimize this overhead, ring credits are sometimes used. A ring credit indicates that there is a free ring entry for the producer to use. The consumer passes credits to the producer, which the producer adds to its local credit pool. The mechanism for passing credits involves delivery into a credit pool by the consumer and fetching from a credit pool by the producer. Credit passing can also be batched, which reduces the traffic used for producer credit notification but does not address the cost of notifying the consumer that the ring is non-empty.
Also coupled to the FSB 16 is a memory controller 20, which connects to a memory 22. The memory 22 is shared by and common to the various agents of the system 10. The memory controller 20 manages accesses to the shared memory 22 by such agents. The memory controller 20 may serve as a hub or bridge, and therefore includes circuitry that connects to and communicates with other system logic and I/O components, shown collectively as system logic and I/O block 34. Components in the system logic and I/O block 34 may connect to a backplane, external devices, and/or communication links.
The processor 12 and the GPP 14 each include a cache 24, 38. The size and organization of the cache are matters of design choice. For example, the cache 24 may be organized as an 8-way set associative cache, with each set including 2048 cache lines of 64 bytes per cache line. The cache may also comprise a hierarchy of caches of different sizes and organization.
Maintained in the shared memory 22 are data structures implementing rings (e.g., ring arrays) 26 and associated ring descriptors 28. One side of each ring is managed by hardware, shown as ring manager 30, in the processor 12. The ring manager 30 is connected to and accessed by the processor's agents (e.g., MEs and control processor), shown as ring users 31, via internal bus 36. The ring manager 30 contains FIFOs (not shown) for buffering commands and data being transferred between the agents of the processor 12 and the shared memory 22. The set of command FIFOs include an array of enqueue FIFOs and an array of dequeue FIFOs presented to the agents as an array of registers (one enqueue register and one dequeue register per ring) used to place data on a ring or to remove data from a ring, respectively. The ring users 31 issue writes to entries in the array of enqueue registers or reads to entries in the array of dequeue registers, with other operations handled in the hardware of the ring manager itself. The ring manager 30 also includes Control and Status Registers (CSRs) to store ring configuration parameters (such as ring size) and base values for the ring descriptors, as will be discussed with reference to
The ring manager 30 performs consume access (dequeue) and produce access (enqueue) operations when it receives commands on the internal bus 36. These ring access commands are received in the command FIFO registers, as discussed above. They may be in the form of “put” or “get” commands (e.g., for an ME ring user), or Load or Store instructions (e.g., for an Xscale core ring user), to give a few examples. The GPP 14 uses software routines of the ring manager 32 to perform consume and produce operations.
The ring manager 30 performs other operations besides ring management. For example, the ring manager 30 manages the FSB protocol and interrupts over the FSB 16, directs data to and from the cache 24, maintains a cache coherency protocol and manages cache activities (such as replacement, tag lookups and so forth) for the cache 24 in the processor 12.
The ring manager 30 provides the capability to move data to and from the shared memory 22. The types of access include: read accesses, write accesses and atomic read-modify-write accesses. Read and write accesses can specify whether or not to allocate space in the cache in the event of a cache miss. Atomic read-modify-write commands are used to maintain coherency over semaphores in the shared memory 22.
The processor 12 and GPP 14 allow selected areas of shared memory to be cached and a type of caching (called “memory type”) to be specified for selected areas. Supported memory types include: Uncacheable (UC); Write-Through (WT); Write Back (WB); Write Protected (WP); and Write Combining (WC).
If the UC memory type is specified, the selected area is not cached. For WT, writes and reads to and from selected area are cached. Reads come from cache lines on cache hits and read misses cause cache fills. All writes are written to a cache line and through to the shared memory. This mechanism enforces coherency between the various caches and the shared memory. With the WB memory type, writes and reads to and from shared memory are cached. Reads come from cache lines on cache hits; read misses cause cache fills. Write misses cause cache line fills, and writes are performed entirely in the cache, when possible. A write back operation is triggered when cache lines need to be deallocated. In WP mode, reads come from cache lines when possible, and read misses cause cache fills. Writes are propagated to the system bus and cause corresponding cache lines on all processors on the bus to be invalidated. When WC is used, shared memory locations are not cached, and writes may be delayed and combined in a write buffer to reduce memory accesses. The processor 12 and GPP 14 uses their respective caches to hold local copies of recently accessed data, including ring data and descriptors, from the shared memory, to reduce FSB 16 bandwidth used by the processors. Coherence (or consistency) may be maintained among multiple caches 24, 38 and shared memory 22 by a variety of different well known mechanisms such as snooping caches and directories and well-known protocols such as Modified Exclusive Shared Invalid (MESI) or Modified Owner Exclusive Shared Invalid (MOESI).
In one exemplary embodiment, as described herein, the snooping cache coherency protocol that is used is the MESI protocol, where “MESI” refers to the four cache states “modified” (M), “exclusive” (E), “shared” (S), and “invalid” (I). Each cache line in the cache can be in one of the four states. The ring data structures 26, 28 are used by the agents in such a way as to minimize the memory communication and resulting coherence traffic between the producing and consuming agent(s) on each ring, as will be described.
The GPP 14 uses both page cachability attributes and Memory Type Range Registers (MTRRs) to determine cache attributes of FSB accesses. Similarly, the processor 12 uses both FSB instruction type and MTRR to define FSB transaction cachability. The MTRRs allow the type of caching to be specified in the shared memory for selected physical address ranges. Table 1 below defines a cache allocation policy, according to the described embodiment. Other cache allocation policies may be used as well.
The ring manager 30 monitors (“snoops”) FSB 16 accesses from other processors, such as the GPP 14, and responds as required to keep the cache 24 and other processor's caches coherent. This snooping activity is handled by hardware in the ring manager 30, and is transparent to software. The snoop response can indicate a hit for addresses that are not actually in the cache but for which the cache has responsibility, since the ring manager 30 maintains coherency for data from the point in time that it has initiated a FSB read for data it intends to modify, until the data is written out on FSB 16. The modified data could be in flight from memory, in internal ring manager buffers, in the cache 24, in the process of being evicted from the cache 24, and so forth. The ring manager 30 will stall snoop responses when the address hits a locked cache line. Cache lines are locked during updates for ring operations, as will be described later.
As shown in
Each ring can be independently configured for size and can be independently located in memory (i.e., the different rings may not reside in a contiguous region of memory). Several techniques are applicable to ring size configuration. For example, a ring or group of rings could be configured by a control register indicating the ring size. Alternately, the ring size may be stored as data in a ring's descriptor. The size and the alignment of each memory array representing a ring may be restricted to a power of 2 to allow the full pointer to be stored in one location in the ring descriptor. By using the ring-size to determine which high-order bits to hold constant and which to include in the incrementing pointer, a ring base and an incrementing index for each ring can be stored efficiently in the ring's descriptor. Alternatively, one could support arbitrary alignment and/or arbitrary size of independently located rings by storing the ring upper_bound address and the ring size (or equivalently the ring_base and the ring size) and when the boundary is reached, the pointer is reset to the bound minus the size (or equivalently is set to the base value). Threshold values (e.g., for pushing out a new public pointer, for drawing credits from the public pool and for spilling credits to the public pool) are all configurable uniquely per ring.
The rings facilitate message passing between agents, in both directions. The hardware ring manager 30 may be defined with modes for communication in both directions, selected on a per-ring basis. That is, each ring is configurable to select whether the GPP(s) 14 or the hardware of the processor 12 acts as the consumer, and the other as the producer. The direction could be hardwired as well. A given ring is configured for use in one direction only.
Referring now to
Referring to
A higher threshold causes less frequent updates, and therefore uses fewer cycles on the FSB 16. A higher threshold also causes more delay in notifying consumers of new information on the ring. The threshold field might contain the threshold value or a code indicating a threshold value. In one implementation, 3-bit values can be used to define eight corresponding thresholds (each specifying a given number of entries).
Thus, the value in the P_Count field 82 provides a count of the number of items enqueued since the last time the public produce pointer was updated with the value of the private (actual) produce pointer, and the threshold value stored in the Threshold field 84 is compared to the count to determine when to update the public pointer and thus pass those new items to the consumer. The Public_Tail field 90 is read by the consumer and used to update the C_Count field 74 in the head descriptor 50 whenever the value of the C_Count field 74 is less than the amount of data need to satisfy a consume access operation. The private and public copies of the produce pointer therefore serve to moderate notification to the consumer and to minimize cache/memory traffic. Note that the consumer could periodically poll the ring, or be notified by a sideband communication (such as an interrupt) from the producer that there is data in the ring to be serviced, possibly with a threshold to batch such notification and possibly with an associated timer in order to bound notification delays. To support polling, the ring data structures could be configured to return a NULL value if not enough data is present in the ring to satisfy the size requested.
Ring credit exchange between a consumer and producer might represent the actions of a single ring user at each end of the ring, or might represent the collective actions of a multiplicity of ring users sharing that ring at either end. Each user on a shared ring can use the mutual exclusion lock to atomically draw credits from the public credit count into a local, private variable or to return credits to the public pool. The public credit count value is incremented by the consumer and decremented by the producer as credits are moved from a consuming user's private local “pool” (a private credits count maintained in a local credits variable by a user at the consuming end of the ring) to the public credit pool (that is, the credit descriptor's Credit field value e.g. Credits 100 in credit descriptor 56) to a producing user's private local pool (a private credits count maintained in a local credits variable by a user at the producing end of the ring). In the illustrated embodiment, the ring users themselves maintain the local credit variables and perform the credit management. Alternatively, or in addition, the private credits count(s) could be stored with other local variables in the ring descriptors (the head descriptor for a consumer's private credit count and the tail descriptor for a producer's private credit count).
The producer subtracts the number of ring locations written during produce access operations from the available local credits. The producer replenishes its local credits from the Credits field 100 whenever it determines that the number of local credits has dropped below a low “watermark” (or threshold level). For example, the producer could take 5000 credits from the public credit pool whenever the number of local credits goes below 200, checking first to make sure that public credit pool would not go below zero as a result. Allocating the credits in a group and caching them locally in this manner minimizes the amount of shared memory traffic used to allocate the ring credits. The consumer accumulates credits into its local credits variable each time it finishes processing data it has read from the ring, adding the number of removed ring items to the local credits. When the consumer determines that the number of local credits has gone above a high watermark (or threshold level) the consumer adds the local credits to the Credits field 100. Accumulating credits locally and then returning the credits in a group minimizes the amount of shared memory traffic used to return the ring credits to the public credit pool. There will actually be more space available on the ring than indicated in credits, because of the batching up of allocating and freeing credits described above. A large number of (or all available) credits may be allocated or freed at one time to minimize accesses and the likelihood of collisions to the shared Credits field, which is incremented and decremented under protection of a mutual exclusion lock (mutex) shared between a hardware ring manager 30 and a software ring manager 32 in the shared memory.
In one exemplary embodiment, the credit management is executed in software on the GPP 14 and the ring users of the processor 12. Alternatively, the credit management could be performed by hardware. Note also that this mechanism does not preclude multiple producers writing into a particular ring or multiple consumers from reading from a particular ring. In that situation, it is desirable to set the watermark thresholds such that all producers are able to prefetch sufficient credits into their local credit pool without starving the other producers, and the rings are sized appropriately to account for the less-efficient ring utilization due to multiple such “credit caches”. In addition, when multiple producers are sharing a ring, the producer should not automatically fetch all available credits, rather having a policy which ensures that greedy credit-fetching does not starve the peer producers. The ring size and the consumer thresholds for returning credits must similarly account for the fact that more than one consumer is collecting credits for batch return to the pool for that ring. The watermark threshold values for drawing credits from the public pool and for returning credits to the public pool are all configurable uniquely per ring and per ring user.
Consume access (dequeue) and produce access (enqueue) operations both access parts of the ring descriptor. Referring to
Referring to
Still referring to
The shared memory write 150 is much the same as the shared memory write used for non-ring-related shared memory accesses, except that the address used in the ring case is derived indirectly from the ring number and command. Details of the memory write 150, according to one exemplary implementation, are as shown in
All activities but the credits update (at 132) are performed by the ring manager 30 in the processor 12. Similar activities are performed by the software ring manager 32 running on the GPP 14.
Produce access operations update the value of Public_Tail based on the threshold value, as described above (at 140 and 142). To handle the case where accumulated operations do not reach the threshold for a long time (for example, due to a pause in the production of messages) a timeout mechanism may be provided to ensure that consumers are notified of the arrival of those messages. Otherwise, the P_Count value for a particular ring and ring descriptor may stay the same for an unbounded amount of time. The timeout may be handled by the ring manager hardware for rings used for processor ring users 31 to GPP 14 communications, and by the ring manager software in the GPP 14 for rings used by the GPP 14 as a producer and ring users 31 as consumers for communications going in the opposite direction.
One example of a timeout mechanism would be a hardware process in the ring manager 30 and a software process in the routines of the ring manager 32 that scans all of the tail/produce descriptors and triggers an update for a particular ring if that ring was non-empty when visited. Each ring would be visited once per timeout period.
In another example implementation, the ring manager 30 may include timeout support in the form of timeout state machines to manage timeouts for some number of rings. In such an example, and as shown in
It may be desirable to reduce the frequency of updates to the public produce pointer when the produce pointer is maintained in a shared coherent memory and the agents have different cached copies, as described above. Each write access to the produce pointer puts that copy in the ‘M’ state in the writer's cache, and subsequent writes can be immediately serviced. Each read of that pointer by the other agent moves it to ‘S’ state in both caches, requiring the writer to go out on the coherent bus to regain exclusive ownership of the pointer variable for the next enqueue-increment. It is advantageous to minimize the number of such coherent transactions as messages are passed with high throughput between agents. Batching of notification can also help to optimize software in high-throughput situations by amortizing the software overheads of entering and leaving the service routine, only invoking it when there are a whole batch of entries to service (or expecting a high success rate when polling the ring periodically.) The timeout mechanism to force the update of the public pointer helps to ensure that a partial batch does not wait too long for service.
The ring manager compares 206 the value in the C_Count field 74 in the head descriptor to the message size to determine if enough data is available on the ring to complete the requested operation. If the C_Count 74 test indicates that there is not enough data on the ring (that is, the value in the C_Count field 74 is less than the message size), then the ring manager reads 208 the public descriptor 54 into cache, if the public descriptor 54 is not cached already. The public descriptor read can also be performed according to the ring descriptor read 110 (
After the cache line lock status bit is set and any necessary head descriptor updates of the C_Count 74 and Prev_Tail 72 fields based on the Public_Tail 90 field are performed, the ring manager updates 216 the Head_Ptr 70 and C_Count 74 fields according to the message size (amount of data being removed from the ring). Once it has made those updates, the ring manager clears 218 the cache line lock status bit.
The ring manager checks 220 the cache to determine if it holds all of the data addressed by the pre-modified head pointer. If there is a cache miss, the ring manager reads 222 that data from the shared memory into the cache, based on the cache allocation policy using the head pointer as the address. The ring manager then provides 224 the data to the requesting agent. If the data already resides in the cache, then it is provided to the requesting agent (at 224) without the need for the FSB shared memory access.
The shared memory read is much the same as that used for non-ring-related shared memory accesses, except that the address used in the ring case is derived from the ring number. Details of the shared memory read access 222, according to one exemplary implementation, are as shown in
All activities but the credits update (at 202) are performed by the ring manager 30 in the processor 12. Similar activities are performed by the ring manager software running on the GPP 14.
Before a given ring can be used it must be initialized. Initialization is handled by software executing on the agents, e.g., the processor's ring users 31, or the GPP 14. Referring to
As indicated above, multiple producers and/or consumers may operate on ring data from the same processor. For example, multiple processor cores may wish to write/read data to/from a ring. A variety of implementations can handle multiple producers/consumers. For example, a processor may feature a single ring producer and/or consumer agent that handles ring requests received from other agents. For example, the different processor cores may funnel ring access requests into a queue serviced by a ring producer thread operating on one of the cores. In such an implementation the single processor ring producer and/or consumer agent may have exclusive control of the data in private producer and/or consumer data structures.
Alternately, the different producers and consumers may independently access the ring data structures of a given ring. In such an implementation, mechanisms (e.g., a mutual-exclusion lock (“mutex”)) may be used to resolve contention issues between the agents. For example, the head and tail descriptors may be protected by mutual-exclusion (mutex) locks that restrict access to the descriptors to one respective consumer or producer agent at a time. Alternately, mutexes may be used at finer granularity. For instance, one mutex may lock the private consumer pointer while another locks the private consumer credit count. Additionally, the multiple agents may maintain their own credit pools that they contribute to/take from the private producer/consumer credit pools.
To illustrate operation of an implementation featuring multiple independent producers, the producers can write ring entries by, for example, acquiring a mutex to the private producer pointer; obtaining the necessary credits from the producer credit count; writing the ring entries; updating the private producer pointer; and releasing the mutex. Potentially, the producer may adjust the shared producer pointer and/or the shared credit count which may also entail acquiring respective mutexes. Likewise, multiple independent consumers can dequeue ring entries by, for example, acquiring a mutex to the private consumer pointer; dequeuing ring entries; updating the private consumer pointer; and releasing the mutex. Potentially, a given consumer may update the private consumer credit pool and/or the public credit pool. Again, this updating may require acquisition of one or more mutexes.
The multi-processor system 10 (of
The line card is where line termination and I/O processing occurs. It may include processing in the data plane (packet processing) as well as control plane processing to handle the management of policies for execution in the data plane. The blades 272a-272m may include: control blades to handle control plane functions not distributed to line cards; control blades to perform system management functions such as driver enumeration, route table management, global table management, network address translation and messaging to a control blade; applications and service blades; and content processing blades. The switch fabric or fabrics may also reside on one or more blades. In a network infrastructure, content processing may be used to handle intensive content-based processing outside the capabilities of the standard line card functionality including voice processing, encryption offload and intrusion-detection where performance demands are high.
At least one of the line cards, e.g., line card 274a, is a specialized line card that is implemented based on the architecture of system 10, to tightly couple the processing intelligence of a general purpose processor to the more specialized capabilities of a network processor. The line card 274a includes media interfaces 278 to handle communications over network connections. Each media interface 278 is connected to a processor 12, shown here as network processor (NP) 12. In this implementation, one NP is used as an ingress processor and the other NP is used as an egress processor, although a single NP could also be used. Other components and interconnections in system 10 are as shown in
The techniques described above may be implemented in a variety of logic. The term logic as used herein includes hardwired circuitry, digital circuitry, analog circuitry, programmable circuitry, and so forth. The programmable circuitry may operate on instructions disposed on an article of manufacture (e.g., a volatile or non-volatile memory device).
Other embodiments are within the scope of the following claims.