This disclosure relates to electronic assemblies and communication within electronic assemblies.
High performance computing systems are important for many applications. However, conventional computing system designs can encounter significant communication latency in on-chip networks, leading to decreased performance.
The innovations described in the claims each have several aspects, no single one of which is solely responsible for its desirable attributes. Without limiting the scope of the claims, some prominent features of this disclosure will now be briefly described.
In some aspects, the techniques described herein relate to a method of routing a packet in a computing system, the method including: outputting a first bypass signal and a second bypass signal from a first computing node of an array of computing nodes, wherein the first bypass signal indicates to route a packet through a second computing node of the array of computing nodes, and wherein the second bypass signal indicates to turn the packet in a third computing node of the array of computing nodes; routing the packet through the second computing node based on the first bypass signal from the first computing node, wherein the packet is routed from the first computing node through the second computing node in a single clock cycle, and wherein the second computing node receives the first bypass signal by way of a faster route than the second computing node receives the packet; and turning the packet in the third computing node based on the second bypass signal, wherein the packet is received by the third computing node from the second computing node.
In some aspects, the techniques described herein relate to a method, wherein the third computing node receives a third bypass signal that is based on the second bypass signal by way of a faster route than the third computing node receives the packet.
In some aspects, the techniques described herein relate to a method, wherein the packet is routed through the third computing node in two clock cycles.
In some aspects, the techniques described herein relate to a method, wherein the packet includes a header portion and a data portion, and the header portion is routed one cycle ahead of the data portion.
In some aspects, the techniques described herein relate to a method, wherein routing the packet through the second computing node includes: routing the header portion in a first clock cycle; and routing the data portion in a second clock cycle.
In some aspects, the techniques described herein relate to a method, wherein routing the packet through the second computing node includes: storing the first bypass signal in a state element of the second computing node; routing the header from the first computing node to the second computing node based at least in part on the first bypass signal; and after routing the header from the first computing node to the second computing node, routing the data portion from the first computing node to the second computing node based at least in part on the first bypass signal.
In some aspects, the techniques described herein relate to a method, wherein the packet includes a plurality of sub-packets, each sub-packet includes a header and a data portion, and said routing the packet through the second computing node includes: routing the plurality of sub-packets from the first computing node to the second computing node; and comparing at least a portion of each header of each of the plurality of sub-packets.
In some aspects, the techniques described herein relate to a method, further including: determining that there is a header mismatch based on said comparing; and providing an error signal responsive to said determining.
In some aspects, the techniques described herein relate to a method, wherein routing the packet through the second computing node is further based one or more other packets waiting to exit the second computing node and an available capacity of a destination queue of the packet.
In some aspects, the techniques described herein relate to a method, further including outputting a third bypass signal from the second computing node, wherein the third bypass signal indicates to route another packet through a fourth computing node of the array of computing nodes.
In some aspects, the techniques described herein relate to a method, wherein when the first bypass signal indicates that the packet can bypass the second computing node, routing the packet from the first computing node to the second computing node includes routing the packet on a connection that does not allow the packet to turn at the second computing node.
In some aspects, the techniques described herein relate to a computing system including: a first computing node; and a second computing node, wherein the first and second computing nodes are included in a computing node array, and wherein the first computing node is configured to route a bypass signal on a first route to the second computing node and to route packet data to the second computing node on a second route, wherein the first route is faster than the second route, and wherein the bypass signal is indicative of whether to turn the packet data in the second computing node.
In some aspects, the techniques described herein relate to a computing system, further including a third computing node, wherein the first, second, and third computing nodes are included in a same row or column of the computing node array, and wherein the first computing node is configured to output a second bypass signal indicative of whether to turn the packet data at the third computing node.
In some aspects, the techniques described herein relate to a computing system, wherein the third computing node is configured to turn the packet and output the packet in two clock cycles.
In some aspects, the techniques described herein relate to a computing system, wherein the packet includes a header and a data portion, and the second computing node is configured to route the header to the third computing node at least one clock cycle before routing the data portion to the third computing node.
In some aspects, the techniques described herein relate to a computing system, wherein the packet includes a plurality of sub-packets, each sub-packet includes a header and a data portion, and the second computing node is configured to compare at least a portion of the header of each sub-packet.
In some aspects, the techniques described herein relate to a computing system, wherein the computing system is configured to route the packet through the second computing node in path between the first computing node and the third computing node in a single clock cycle.
In some aspects, the techniques described herein relate to a computing system, wherein the computing system is configured to perform neural network training.
In some aspects, the techniques described herein relate to a computing system, wherein a system on a wafer includes the computing node array.
In some aspects, the techniques described herein relate to a computing system, wherein the computing system is configured to determine the first route based at least partly on at least one of a number of other packets waiting to exit the second computing node or an available capacity of a destination queue for the packet.
This disclosure is described herein with reference to drawings of certain embodiments, which are intended to illustrate, but not to limit, the present disclosure. It is to be understood that the accompanying drawings, which are incorporated into and constitute a part of this specification, and for the purpose of illustrating concepts disclosed herein and may not be to scale.
The following description of certain embodiments presents various descriptions of specific embodiments. However, the innovations described herein may be embodied in a multitude of different ways, for example, as defined and covered by the claims. In this description, reference is made to the drawings where like reference numerals may indicate identical or functionally similar elements. It will be understood that elements illustrated in the figures are not necessarily drawn to scale. Moreover, it will be understood that certain embodiments may include more elements than illustrated in a drawing and/or a subset of the elements illustrated in a drawing. Further, some embodiments may incorporate any suitable combination of features from two or more drawings.
The computing nodes 102 of the array 100 can interface with each other to implement distributed computing functionality. In some embodiments, each computing node of the array 100 can execute computing operations that can include one or more of computation, storage, routing determinations, external communications, and so forth. In some embodiments, each computing node in the plurality of computing nodes 102 can be an instance of the same design. However, in some embodiments, an array can include two or more types of nodes with different capabilities, such as different routing capabilities, different computing capabilities (including, for example, no computing capabilities), different amounts of memory (e.g., static random access memory (SRAM)), different sensors (e.g., temperature, voltage, etc.), and so forth. In certain applications, the array 100 can be implemented on a system on a wafer.
In a multi-computing node network, for example as shown in
As a packet travels across the computing node, a network routing determination can be made regarding whether to route the packet straight, turn the packet, or that the packet has reached its destination. If a system waits for the packet to arrive at a computing node before making a routing decision regarding the routing path of the packet from the computing node, then the system may not be able to accomplish both receipt of the packet and making the routing decision within a single cycle. More specifically, using a single cycle to both transport the packet and to determine where to route the packet next to reach its destination can be difficult to accomplish in a single cycle without making computing node sizes smaller than desired. Accordingly, such approaches can be inefficient and have significant packet communication latency.
Embodiments of this disclosure can address inefficiencies with packet routing. In some applications, the width, height, or both of an on-chip network can be selected based at least in part on the time it takes a packet to travel on an average global wire, where a global wire can route signals between computing nodes. In some embodiments, a system can include a number of wider and/or thicker wires that can be used for carrying critical signals. For example, the wider or thicker wires can carry valid bits, a field indicating which virtual channel a packet is traveling in, and so forth. In some embodiments, there can be greater space between the wider or thicker wires to reduce coupling between wires. A thicker or wider wire can, in some cases, transport information more quickly than regular wires. However, only a limited number of such wires may be available. Such wires may take up significantly more space than a regular wire, for example as much space as about 3, about 4, or about 5 regular wires. In some embodiments, as a packet enters a computing node array, a processing routine can conduct a lookup in a routing table to determine which computing node row and column the packet should turn in. The wider or thicker wires can be in a higher level metal layer than narrower wires. For packets that are traveling within a die, a row/column identifier field or the like can be used directly without a routing table to determine where a packet can turn. In some embodiments, the processing routine can determine if the packet, after turning, will terminate at a different computing node or continue off the edge of the die.
In some embodiments, for individual computing nodes, the processing routine can determine (e.g., decode) whether a packet should turn at a computing node that is two network hops away. For example, if a packet is traveling horizontally and should turn at column 15, the system can be configured to determine this turn when the packet is at a computing node in column 13. This determination can be used to generate a bypass eligible signal. The bypass eligible signal can be communicated over a faster route (e.g., a thicker and/or wider wire) so that the decode bypass eligible determination and the transport of the packet across a computing node can be performed in a single clock cycle. For example, the processing routine can conduct a bypass eligibility determination at each computing node, such that the determination can occur in time to allow the packet to turn at the correct location.
In some embodiments, the bypass eligible signal can be carried on a wider or thicker wire as the packet leaves a neighboring computing node. For example, with continued reference to the example above, the bypass eligible signal can be carried on a wider or thicker wire as the packet leaves computing node 14. Thus, the control signal can arrive before the packet at column 15 and can be used to steer the packet's data.
In some embodiments, a packet can have two indicators related to bypassing computing nodes (e.g., whether to route through a computing node without turning). A “bypass” (BYP) signal can indicate if the packet is permitted to bypass the next computing node, and a “bypass next” (BYP_NEXT) signal can indicate if the packet is permitted to bypass a computing node that is two hops away. When a packet reaches the next computing node, the BYP_NEXT value can become the new BYP value, and a new BYP_NEXT value can be determined. By determining whether to bypass and route through the next two computing nodes (e.g., whether the packet is turning at the next computing node or the computing node after the next computing node), there can be sufficient time to determine the route and send the packet while reducing wasted cycles. In principle, a different number of operations could be used. For example, bypass signals can be determined three hops away, four hops away, and so forth. In some embodiments, the control signals can be carried on faster wires while the data travels on regular, slower wires. The faster wires for routing such control signals can be implemented on higher-level metal layers than slower wires for routing packet data. For example, a semiconductor device made according to modern processes can include multiple metal layers, e.g., ten layers, fifteen layers, or some other number of layers. Lower metal layers typically can be narrower and thinner than higher metal layers to accommodate high density and typically carry signals over a relatively short range. Layers higher in the stack typically have thicker/wider wires to support global communication and efficient distribution of power and/or clock signals. In some embodiments, the top one or two layers can be used for carrying bypass signals, and the next one or two layers can be used for carrying the bulk of the packets from node to node.
The number of operations to pre-determine can be based at least in part on the speed of the faster wires compared to the regular wires, the number of faster wires available, and so forth. For example, determining more hops in advance can allow more time for performing computations. Thus, for example, a packet can be adaptively routed based on congestion rather than statically routed based on destination node address. However, determining bypassing one or more nodes in advance can place additional demands on the faster wires, which can have constrained capacity.
Each computing node N-2, N-1, N can receive and/or generate a bypass signal BYP. The bypass signal BYP is indicative of whether to continue routing the packet forward along a row or column. Bypass logic 205A, 205B, 205C of a computing node can determine whether to route the packet forward based at least partly on the bypass signal BYP. When the bypass logic 205A, 205B, or 205C determines to route the packet forward, a select signal for a respective multiplexer 201A, 201B, or 201C can be asserted to select the packet. This can allow the packet to propagate along a same row or column as the packet was received by the computing node. When the bypass logic 205A, 205B, or 205C determines to turn the packet, the packet can be stored by respective state elements 202D, 202E, or 202F. The packet can then be selected by asserting a select signal for a respective multiplexer 201D, 201E, 201F in a following clock cycle to cause the packet to propagate outside the computing node on a route that is perpendicular to a route on which the computing node received the packet.
The BYP_NEXT value for computing node N-2 can be the BYP value for computing node N-1. The BYP_NEXT value for computing node N-1 can be determined by, for example, comparing the current computing node (e.g., N-2) to the computing node where a packet will turn (e.g., N). If the turning computing node (e.g., N) is two hops away from the current computing node (e.g., N-2), then the BYP_NEXT value for computing node N-1 can be set to a value indicating to turn at computing node N (e.g., a value of zero). Otherwise, BYP_NEXT for node N-1 can be set to a value indicating to route the packet forward at computing node N without turning (e.g., a value of one). Thus, for example, if an incoming packet to computing node N-2 should turn at computing node N, BYP_NEXT and BYP can both initially be set to a value indicating that it is okay to bypass node N-2 and to bypass node N-1. After computing node N-2, BYP can take on the previous value of BYP_NEXT for computing node N-1 (e.g., 1), indicating that node N-1 can be bypassed. A new BYP_NEXT can be computed and, in the current example of a packet that turns at computing node N, be set to zero. When the packet is at node N-1, the BYP value can be set to the previous value of BYP_NEXT (e.g., zero), indicating to turn the packet at computing node N and that the packet cannot bypass computing node N.
As shown in
In some embodiments, a system can have a bypass control mechanism that can give priority to packets that are eligible to bypass a particular computing node, while still enabling other traffic to exit the computing node. In some embodiments, whether or not a packet bypasses a computing node can depend on more than whether or not the packet is eligible to bypass (e.g., whether or not BYP is yes). For instance, bypassing can depend on the number of packets waiting (e.g., packets ahead of an arriving packet that did not bypass or exit in a previous cycle, packets waiting to turn at the computing node, etc.), whether or not queues are full or near capacity, and so forth. For example, if a destination or intermediate queue that the packet will route to is full or near capacity, there may be little or no benefit to expediting the packet, and resources may instead be used for routing other packets.
In some embodiments, a packet can have a smaller header portion (e.g., about 20 bits) and a larger data portion (e.g., about 200 bits, about 400 bits, about 800 bits, etc.). The header portion can be used for controlling the packet's path through the network. In some embodiments, the header can proceed through the network using the mechanisms described herein, and the rest of the packet (e.g., the data) can follow one cycle after the header. In some embodiments, the same signals that control the header can be stored in a state element, such as a flip flop, and in the following cycle can fan out to control the rest of the packet.
In some embodiments, a packet can be large compared to other packets. For example, the packet can be a data packet that carries a relatively large amount of data. In some embodiments, the processing routine can divide the packet into multiple, smaller packets, for example two packets, three packets, four packets, and so forth. The processing routine can duplicate the header and send a fraction of the data (e.g., half for two packets, one fourth for four packets, and so forth) with each copy of the header. In some embodiments, the processing routine can be configured so that the headers travel through the system in lockstep, and thus the fanout to the data can also always happen within a single clock cycle. In some embodiments, the processing routine can conduct a parity check to verify that all copies of the data remain in lockstep.
If there are unexpected differences in the headers 712A-712D, this can indicate a problem in the transmission of the of the packets from the first computing node 702 to the second computing node 704. In some embodiments, the system can be configured to provide an error signal, to reboot, and/or to take other actions. In some embodiments, if there is an unexpected mismatch in the headers 712A-712D, the system can be configured to adjust one or more operating parameters. For example, the system can reduce an operating frequency, increase an operating voltage, and so forth.
In some embodiments, when a packet turns, it can take one extra clock cycle to turn the packet. This can occur because, for example, the flip flops and logic for horizontal parts of the network and for vertical parts of the network often do not reside in the same (or adjacent) physical location on the die. Thus, in some embodiments, turning the packet can take two clock cycles, whereas a packet may be routed straight through a computing node in a single clock cycle. Thus, it may be advantageous to minimize the number of turns taken to route a packet from a source to a destination.
The systems and methods herein can be used in a variety of processing systems for high performance computing and/or computation-intensive applications, such as neural network processing, neural network training, machine learning, artificial intelligence, and so forth. In some applications, the systems and methods described herein can be used in generating data for an autopilot system for a vehicle (e.g., an automobile), other autonomous vehicle functionality, and/or Advanced Driving Assistance System (ADAS) functionality.
In the foregoing specification, the systems and processes have been described with reference to specific embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the embodiments disclosed herein. The specification and drawings are, accordingly, to be regarded in an illustrative rather than restrictive sense.
Indeed, although the systems and processes have been disclosed in the context of certain embodiments and examples, it will be understood by those skilled in the art that the various embodiments of the systems and processes extend beyond the specifically disclosed embodiments to other alternative embodiments and/or uses of the systems and processes and obvious modifications and equivalents thereof. In addition, while several variations of the embodiments of the systems and processes have been shown and described in detail, other modifications, which are within the scope of this disclosure, will be readily apparent to those of skill in the art based upon this disclosure. It is also contemplated that various combinations or sub-combinations of the specific features and aspects of the embodiments may be made and still fall within the scope of the disclosure. It should be understood that various features and aspects of the disclosed embodiments can be combined with, or substituted for, one another in order to form varying modes of the embodiments of the disclosed systems and processes. Any methods disclosed herein need not be performed in the order recited. Thus, it is intended that the scope of the systems and processes herein disclosed should not be limited by the particular embodiments described above.
It will be appreciated that the systems and methods of the disclosure each have several innovative aspects, no single one of which is solely responsible or required for the desirable attributes disclosed herein. The various features and processes described above may be used independently of one another or may be combined in various ways. All possible combinations and sub-combinations are intended to fall within the scope of this disclosure.
Certain features that are described in this specification in the context of separate embodiments also may be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment also may be implemented in multiple embodiments separately or in any suitable sub-combination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination may in some cases be excised from the combination, and the claimed combination may be directed to a sub-combination or variation of a sub-combination. No single feature or group of features is necessary or indispensable to each and every embodiment.
It will also be appreciated that conditional language used herein, such as, among others, “can,” “could,” “might,” “may,” “for example,” and the like, unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps. Thus, such conditional language is not generally intended to imply that features, elements and/or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without author input or prompting, whether these features, elements and/or steps are included or are to be performed in any particular embodiment. The terms “comprising,” “including,” “having,” and the like are synonymous and are used inclusively, in an open-ended fashion, and do not exclude additional elements, features, acts, operations, and so forth. In addition, the term “or” is used in its inclusive sense (and not in its exclusive sense) so that when used, for example, to connect a list of elements, the term “or” means one, some, or all of the elements in the list. In addition, the articles “a,” “an,” and “the” as used in this application and the appended claims are to be construed to mean “one or more” or “at least one” unless specified otherwise. Similarly, while operations may be depicted in the drawings in a particular order, it is to be recognized that such operations need not be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Further, the drawings may schematically depict one or more example processes in the form of a flowchart. However, other operations that are not depicted may be incorporated in the example methods and processes that are schematically illustrated. For example, one or more additional operations may be performed before, after, simultaneously, or between any of the illustrated operations. Additionally, the operations may be rearranged or reordered in other embodiments. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems may generally be integrated together in a single software product or packaged into multiple software products. Additionally, other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims may be performed in a different order and still achieve desirable results.
Further, while the methods and devices described herein may be susceptible to various modifications and alternative forms, specific examples thereof have been shown in the drawings and are herein described in detail. It should be understood, however, that the embodiments are not to be limited to the particular forms or methods disclosed, but, to the contrary, the embodiments are to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the various implementations described and the appended claims. Further, the disclosure herein of any particular feature, aspect, method, property, characteristic, quality, attribute, element, or the like in connection with an implementation or embodiment can be used in all other implementations or embodiments set forth herein. Any methods disclosed herein need not be performed in the order recited. The methods disclosed herein may include certain actions taken by a practitioner; however, the methods can also include any third-party instruction of those actions, either expressly or by implication. The ranges disclosed herein also encompass any and all overlap, sub-ranges, and combinations thereof. Language such as “up to,” “at least,” “greater than,” “less than,” “between,” and the like includes the number recited. Numbers preceded by a term such as “about” or “approximately” include the recited numbers and should be interpreted based on the circumstances (for example, as accurate as reasonably possible under the circumstances, for example ±5%, ±10%, ±15%, etc.). For example, “about 3.5 mm” includes “3.5 mm.” Phrases preceded by a term such as “substantially” include the recited phrase and should be interpreted based on the circumstances (for example, as much as reasonably possible under the circumstances). For example, “substantially constant” includes “constant.” Unless stated otherwise, all measurements are at standard conditions including temperature and pressure.
As used herein, a phrase referring to “at least one of” a list of items refers to any combination of those items, including single members. As an example, “at least one of: A, B, or C” is intended to cover: A, B, C, A and B, A and C, B and C, and A, B, and C. Conjunctive language such as the phrase “at least one of X, Y and Z,” unless specifically stated otherwise, is otherwise understood with the context as used in general to convey that an item, term, etc. may be at least one of X, Y or Z. Thus, such conjunctive language is not generally intended to imply that certain embodiments require at least one of X, at least one of Y, and at least one of Z to each be present. The headings provided herein, if any, are for convenience only and do not necessarily affect the scope or meaning of the devices and methods disclosed herein.
Accordingly, the claims are not intended to be limited to the embodiments shown herein but are to be accorded the widest scope consistent with this disclosure, the principles and the novel features disclosed herein.
This application claims the benefit of U.S. Provisional Application No. 63/235,018, filed Aug. 19, 2021, titled “COMMUNICATION LATENCY MITIGATION FOR ON-CHIP NETWORKS,” the disclosure of which is hereby incorporated by reference in its entirety and for all purposes.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/040497 | 8/16/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63235018 | Aug 2021 | US |