The field of invention relates generally to network equipment and, more specifically but not exclusively relates a scalable, high-performance interconnect scheme for a multi-threaded, multi-processing system-on-a-chip device, such as a network processor unit.
Network devices, such as switches and routers, are designed to forward network traffic, in the form of packets, at high line rates. One of the most important considerations for handling network traffic is packet throughput. To accomplish this, special-purpose processors known as network processors have been developed to efficiently process very large numbers of packets per second. In order to process a packet, the network processor (and/or network equipment employing the network processor) needs to extract data from the packet header indicating the destination of the packet, class of service, etc., store the payload data in memory, perform packet classification and queuing operations, determine the next hop for the packet, select and appropriate network port via which to forward the packet, etc. These operations are generally referred to as “packet processing” operations.
Modern network processors perform packet processing using multiple multi-threaded processing elements (referred to as microengines in network processors manufactured by Intel® Corporation, Santa Clara, Calif.), wherein each thread performs a specific task or set of tasks in a pipelined architecture. During packet processing, numerous accesses are performed to move data between various shared resources coupled to and/or provided by a network processor. For example, network processors commonly store packet metadata and the like in static random access memory (SRAM) stores, while storing packets (or packet payload data) in dynamic random access memory (DRAM)-based stores. In addition, a network processor may be coupled to cryptographic processors, hash units, general-purpose processors, and expansion buses, such as the PCI (peripheral component interconnect) and PCI Express bus.
In general, the various packet-processing elements (e.g., microengines) of a network processor, as well as other optional processing elements, such as general-purpose processors, will share access to various system resources. Such shared resources typically include data storage and processing units, such as memory stores (e.g., SRAM, DRAM), hash units, cryptography units, etc., and input/output (I/O) interfaces. The shared resources and their consumers are interconnected via sets of buses known as the “chassis.” The chassis is a high-performance interconnect on the network processor chip that provides the on-chip data transport infrastructure between numerous processing elements on the chip and the numerous shared resources on-chip or accessible via appropriate built-in chip interfaces.
Under typical network processor configurations, various bus schemes are employed to enable shared access to the shared resources. Since only a single set of signals can be present on a given bus at any point in time, buses require multiplexing and the like to allow multiple resource consumers to access multiple resource targets coupled to the bus. One technique for relieving access contention is to provide separate buses for data reads and data writes for each target. As used herein, these buses are known as push buses (for reads), and pull buses (for writes). (It is noted that terms push and pull are from the perspective of the shared resource target.) However, implementing separate buses for reads and writes for each target increases the bus count, and thus adds to the already crowded signal routing requirements for the network processor chip. Consider, under a conventional approach, sharing access to 16 shared resources requires 16 independent sets of buses, with each set of buses including a read bus, a write bus, and a command bus for a total of 48 buses. To support routing for such a large number of buses, dies sizes must be increased; this directly conflicts with the goal of reducing dies sizes and processor costs.
The foregoing aspects and many of the attendant advantages of this invention will become more readily appreciated as the same becomes better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein like reference numerals refer to like parts throughout the various views unless otherwise specified:
a is a schematic diagram illustrating details of a north command bus, according to one embodiment of the invention;
b is a schematic diagram illustrating details of a south command bus, according to one embodiment of the invention;
a is a schematic diagram illustrating details of a north pull data bus, according to one embodiment of the invention;
b is a schematic diagram illustrating details of a south pull data bus, according to one embodiment of the invention;
a is a schematic diagram illustrating details of a north pull requester identifier (ID) bus, according to one embodiment of the invention;
b is a schematic diagram illustrating details of a south pull ID bus, according to one embodiment of the invention;
a is a schematic diagram illustrating details of a north push data bus, according to one embodiment of the invention;
b is a schematic diagram illustrating details of a south push data bus, according to one embodiment of the invention;
a is a schematic diagram illustrating details of a north push ID bus, according to one embodiment of the invention;
b is a schematic diagram illustrating details of a south push ID bus, according to one embodiment of the invention;
a is a schematic diagram showing further details of the network processor architecture of
b is a schematic diagram showing further details of the network processor architecture of
Embodiments of a scalable, high-performance interconnect scheme for a multi-threaded, multi-processing system-on-a-chip network processor unit are described herein. In the following description, numerous specific details are set forth to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
According to one aspect of the embodiments described herein, a scalable chassis interconnect infrastructure based on principles of a cross-bar architecture is implemented to enable access to a large number of shared resources without requiring individual bus sets for each shared resource. The chassis supports transactions between two types of agents: masters and targets. In one embodiment, the masters are organized into groups (“clusters”) that share common bus infrastructure. The chassis also comprises a set of high performance buses, including command buses that move commands from masters to targets, and respective sets of push and pull data and request identifier (ID) buses.
The embodiment of
In general, architecture 100 represents a logic architecture, wherein the physical location of the various elements may vary from where they are shown in the Figures herein. However, in one embodiment the general location of the targets and masters on a physical device are similar to that shown in
Architecture 100 includes two sets of buses connecting the clusters 102A-D to the various shared resource targets. In one embodiment, each set of buses includes a command bus and two sets of data buses—a push bus for read data, and a pull bus for write data. Thus, each cluster has two independent command buses and two sets of data buses. Additionally, in one embodiment the sets of buses further include associated tag buses (ID buses) for assisting transfer of data to/from the masters and targets.
The buses employed for the north targets located at the top of architecture 100 include a north command bus 140, a north pull data bus 142, a north pull ID bus 144, a north bus data bus 146, and a north push ID bus 148. The buses employed for the south targets located at the bottom of architecture 100 include a south command bus 150, a south pull data bus 152, a south pull ID bus 154, a south push data bus 156, and a south push ID bus 158.
The north command bus 140 circuitry includes a set of horizontal bus lines 200, including bus lines 200A, 200B, 200C, and 200D. It will be understood that each bus line represents a set of signal lines corresponding to a respective bus, rather than a single signal. The width of each bus is dependent on the particular requirements of the network processor implementation. Respective sets of cross-bar buses (depicted as bus lines) are coupled between horizontal bus line set 200 and a respective target via a respective command multiplexer. The cross-bar buses include cross-bar buses 210, 212, 214, 216, 218, and 220, while the multiplexers include command (CMD) multiplexer 222, 224, 226, 228, 230, and 232. Additionally, command multiplexers 234 and 236 are connected at opposite ends of horizontal bus line set 200.
In one embodiment, the number of buses (depicted as bus lines) in a bus line set is equal to the number of clusters in the architecture. For example, in the figures illustrated herein, the network processor architecture includes four clusters. Thus, the number of bus lines depicted for each bus line set is four, indicating there would be four sets of buses. In other embodiments (not shown), the network processor architecture may include other numbers of clusters, such as six, for example. In this case, each bus line set would include six bus lines depicting six sets of buses. In general, the number of bus lines (and thus buses) in a cross-bar bus is equal to the number of bus lines in the horizontal bus to which it is coupled.
The cross-bar bus architecture of north command bus 140 supports a two-stage arbitration scheme. The arbitration scheme is used to selectively connect a given master to a designated target to enable a command to be sent from the master to the target. The first stage is used to select a master from a given cluster, while the second stage is used to select the designated target from among all of the north targets. The outcome of the first arbitration stage, also referred to herein as intra-cluster arbitration, for each of clusters 102A, 102B, 102C, and 102D is depicted as respective OR gates 240A, 240B, 240C, and 240D. The OR gate representation is used to indicate that if any of the masters for a given cluster initiate a target transaction request, the process for arbitrating a request for the entire cluster is initiated. The north target command output stage of each cluster is connected to a corresponding bus line in bus line set 200 via a respective cross-bar bus. These include cross-bar buses 242A, 242B, 242C, and 242D.
In the illustrated embodiment, transactions requests are forwarded between masters in a given cluster using a pipelined scheme. This pipelined design takes advantage of the multithreaded approach used for performing packet-processing used by modern network processors. Thus, a target transaction request is passed from one master to the next master in the pipeline until it reaches the output stage for the cluster. Upon winning intra-cluster arbitration at the output stage and cluster arbitration (i.e., arbitration between concurrent requests issued from multiple clusters), a command is placed on the horizontal bus line corresponding to the cluster.
Another concept illustrated in
As illustrated in
As discussed above, the north and south command bus architectures enable any master to access any target, regardless of whether that target is a north target or a south target. Furthermore, this extends to masters that may also operate as targets, even if the master/target is not on the same north or south region as the target that is to be accessed via a corresponding command that is routed to that target via the combined north and south command bus architecture. For example, a component operating as both a north target and master may access a south target.
In one embodiment, respective intra-cluster arbitration operations for commands issued by masters that are members of a given cluster (or by external master/target components that are associated with that cluster) are performed for each of the north and south target groups. The result of the intra-cluster arbitration for the south target commands is depicted by OR gates 241A, 241B, 241C, and 241D in
Details of north pull data bus 142, according to one embodiment, are illustrated in
A key challenge in such a high performance interconnect scheme is to achieve high utilization and bounded average latency on the buses such that the delivered bandwidth on the interconnect tracks very closely to the peak bandwidth provisioned for the interconnect. As discussed above, due to die-size limitations, the number of targets typically exceeds the number of available intra-chip data routing buses. According to one embodiment, this issue is addressed by grouping multiple targets into sub-groups that share a common data-bus track.
Accordingly, in the illustrated embodiment of
As illustrated in
In one embodiment, respective intra-cluster arbitration operations corresponding to pull data transaction for masters that are members of a given cluster (or by external master/target components that are associated with that cluster) are performed for each of the north and south target groups. The result of the intra-cluster arbitration for the south target pull data bus is depicted by OR gates 341A, 341B, 341C, and 341D in
In another embodiment, the north pull data bus has a configuration analogous to the north command bus, while the south pull data bus has a configuration analogous to the south command bus (both not shown). In these embodiments, the north and south targets groups are not further grouped into sub-groups.
Exemplary embodiments of north and south pull ID buses 144 and 154 are shown in
Each of the north and south pull ID buses employ a two-stage arbitration scheme, including a sub-group arbitration stage and cluster arbitration stage. The sub-group arbitration stage is used to determine which member of the group is allowed access to the pull ID bus. The cluster arbitration stage is used to determine which winning sub-group is allowed access to the cluster to which the master half of the transaction belongs.
In connection with the first stage arbitration operations, a respective sub-group ID multiplexer is provided for each sub-group. These include sub-group ID multiplexers 4020, 4021, 4022, and 4023 for the north target sub-groups, and sub-group ID multiplexers 4520, 4521, and 4523 for the south target sub-groups. In connection with the second level arbitration operations, a respective sub-group selection ID multiplexer is provided to connect a winning sub-group to a corresponding cluster. These include sub-group selection ID multiplexers 410A, 410B, 410C, and 410D for the north target sub-groups, and sub-group selection ID multiplexers 460A, 460B, 460C, and 460D for the south target sub-groups.
Each of the horizontal bus lines is connected to each of the sub-group selection ID multiplexers via respective sets of cross-bar bus lines. These include cross-bar bus line sets 412A, 412B, 412C, and 412D for horizontal bus line set 400 (
Exemplary embodiments of north and south push data buses 146 and 156 are shown in
Each of the north and south push data buses employ a two-stage arbitration scheme, including a sub-group arbitration stage and cluster arbitration stage. The sub-group arbitration stage is used to determine which member of the group is allowed access to the push data bus, while the cluster arbitration stage is used to determine which winning sub-group is allowed access to the cluster to which the master half of the transaction belongs.
In connection with the first level arbitration operations, a respective sub-group data multiplexer is provided for each sub-group. These include sub-group data multiplexers 5020, 5021, 5022, and 5023 for the north target sub-groups, and sub-group data multiplexers 5520, 5521, and 5523 for the south target sub-groups. In connection with the second level arbitration operations, a respective sub-group selection data multiplexer is provided to connect a winning sub-group to a corresponding cluster. These include sub-group selection data multiplexers 510A, 510B, 510C, and 510D for the north target sub-groups, and sub-group selection data multiplexers 560A, 560B, 560C, and 560D for the south target sub-groups.
Each of the horizontal bus lines is connected to each of the sub-group selection data multiplexers via respective sets of cross-bar bus lines. These include cross-bar bus line sets 512A, 512B, 512C, and 512D for horizontal bus line set 500 (
Exemplary embodiments of north and south push ID buses 148 and 158 are shown in
Each of the north and south push ID buses employ a two-stage arbitration scheme, including a sub-group arbitration stage and cluster arbitration stage. The sub-group arbitration stage is used to determine which member of the group is allowed access to the push ID bus. The cluster arbitration stage is used to determine which winning sub-group is allowed access to the cluster to which the master half of the transaction belongs.
In connection with the first stage arbitration operations, a respective sub-group ID multiplexer is provided for each sub-group. These include sub-group ID multiplexers 6020, 6021, 6022, and 6023 for the north target sub-groups, and sub-group ID multiplexers 6520, 6521, and 6523 for the south target sub-groups. In connection with the second level arbitration operations, a respective sub-group selection ID multiplexer is provided to connect a winning sub-group to a corresponding cluster. These include sub-group selection ID multiplexers 610A, 610B, 610C, and 610D for the north target sub-groups, and sub-group selection ID multiplexers 660A, 660B, 660C, and 660D for the south target sub-groups.
Each of the horizontal bus lines is connected to each of the sub-group selection ID multiplexers via respective sets of cross-bar bus lines. These include cross-bar bus line sets 612A, 612B, 612C, and 612D for horizontal bus line set 600 (
During packet processing operations, various transaction requests will be made by various masters to various targets. For the following examples, read transaction requests and subsequent processing of granted requests are considered. The arbitration operations to support arbitration of read transaction requests is referred to herein as “Push” arbitration, and the requests are known as “push-data” (target read) transaction requests. The operations described below to support push-data transactions are generally illustrative of analogous “pull-data” (target write) transactions, except that pull data and pull ID buses having different configurations, as discussed above and shown herein.
In an aspect of some embodiments, bus contention issues are resolved using a two-stage arbitration scheme, wherein the first stage arbitrates between sub-groups and the second stage arbitrates between processing element clusters. For example, details of an embodiment of an exemplary two-stage arbitration scheme employed for the north push data and ID buses are shown in
The north push data bus arbitration scheme of
b shows further details of the push ID bus infrastructure of one embodiment of network processor architecture 100. As with the push data bus, a respective first-stage (sub-group) arbiter 700 (e.g., SG0, SG1, SG2 and SG3) is used to control the operation of sub-group selection ID multiplexers 6021-4. Similarly, a respective second stage (cluster) arbiter 702 (e.g., C0, C1, C2, and C3) is used to control operation of sub-group selection ID multiplexers 610A-D.
Network processors that implement architecture 100 of
In the illustrated embodiment of
As discussed above, the crossbar chassis configuration of network processor architecture 100 enables various masters (e.g., microengines 104) to access various targets via corresponding transactions requests. In some embodiments, some of the actual data or operations supported by the “effective” targets are provided on the network processor chip, while others are provided off-chip. For example, while an NRAM control channel may comprise a target (for the purpose of the foregoing discussion), the effective target is the actual NRAM store that is accessed via the NRAM control channel (since the control channel does not store any data by itself.
In the exemplary configuration shown in
Network devices are used to perform packet-processing operations. One of the primary functions performed during packet processing is determining the next hop to which the packet is to be forwarded. A typical network device, such as a switch, includes multiple input and output ports. More accurately, the switch includes multiple input/output (I/O) ports, each of which may function as either an input or an output port within the context of forwarding a given packet. An incoming packet is received at a given I/O port (that functions as in input port), the packet is processed, and the packet is forwarded to its next hop via an appropriate I/O port (that functions as an output port). The switch includes a plurality of cross-connects known as the media switch fabric. The switch fabric connects each I/O port to the other I/O ports. Thus, a switch is enabled to route a packet received at a given I/O port to any of the next hops coupled to the other I/O ports for the switch.
The exemplary network device of
In general, aspects of the foregoing embodiments may be implemented using programmed logic using semiconductor fabrication techniques. In addition, embodiments of the present description may be implemented within machine-readable media. For example, the designs described above may be stored upon and/or embedded within machine readable media associated with a design tool used for designing semiconductor devices. Examples include a netlist formatted in the VHSIC Hardware Description Language (VHDL) language, Verilog language or SPICE language. Some netlist examples include: a behavioral level netlist, a register transfer level (RTL) netlist, a gate level netlist and a transistor level netlist. Machine-readable media also include media having layout information such as a GDS-II file. Furthermore, netlist files or other machine-readable media for semiconductor chip design may be used in a simulation environment to perform the methods of the teachings described above.
The above description of illustrated embodiments of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.
These modifications can be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific embodiments disclosed in the specification and the drawings. Rather, the scope of the invention is to be determined entirely by the following claims, which are to be construed in accordance with established doctrines of claim interpretation.