This patent application claims priority to China Patent Application No. 202111561477.3 filed Dec. 15, 2021 by Liang HAN et al., which is hereby incorporated by reference in its entirety.
The system 100 incorporates unified memory addressing space using, for example, the partitioned global address space (PGAS) programming model. Thus, in the example of
The system 100 can be used for applications such as but not limited to graph analytics and graph neural networks, and more specifically for applications such as but not limited to online shopping engines, social networking, recommendation engines, mapping engines, failure analysis, network management, and search engines. Such applications execute a tremendous number of memory access requests (e.g., read and write requests), and as a consequence also transfer (e.g., read and write) a tremendous amount of data for processing. While PCIe bandwidth and data transfer rates are considerable, they are nevertheless limiting for such applications. PCIe is simply too slow and its bandwidth is too narrow for such applications.
Embodiments according to the present disclosure provide a solution to the problem described above. Embodiments according to the present disclosure provide an improvement in the functioning of computing systems in general and applications such as, but not limited to, neural network and artificial intelligence (AI) workloads. More specifically, embodiments according to the present disclosure introduce methods, systems, and programming models that increase the speed at which applications such as neural network and AI workloads can be performed, by increasing the speeds at which memory access requests (e.g., read requests and write requests) between elements of the system are sent and received and resultant data transfers are completed. The disclosed systems, methods, and programming models allow processing units in the system to communicate without using a traditional network (e.g., Ethernet) that uses a relatively narrow and slow Peripheral Component Interconnect Express (PCIe) bus.
In embodiments, a system includes a high-bandwidth inter-chip network (ICN) that allows communication between neural network processing units (NPUs) in the system. For example, the ICN allows an NPU to communicate with other NPUs on the same compute node or server and also with NPUs on other compute nodes or servers. In embodiments, communication can be at the command level (e.g., at the direct memory access level) and at the instruction level (e.g., at the finer-grained load/store instruction level). The ICN allows NPUs in the system to communicate without using a PCIe bus, thereby avoiding its bandwidth limitations and relative lack of speed.
Data can be transferred between NPUs in a push mode or in a pull mode. When operating in a command-level push mode, a first NPU copies data from memory on the first NPU to memory on a second NPU and then sets a flag on the second NPU, and the second NPU waits until the flag is set to use the data pushed from the first NPU. When operating in a command-level pull mode, a first NPU allocates memory on the first NPU and then sets a flag on the second NPU to indicate the memory on the first NPU is allocated, and the second NPU waits until the flag is set to read the data from the allocated memory on the first NPU. When operating in an instruction-level push mode, an operand associated with a processing task that is being executed by a first processing unit is stored in a buffer on the first processing unit, and a result of the processing task is written to a buffer on a second processing unit.
These and other objects and advantages of the various embodiments of the invention will be recognized by those of ordinary skill in the art after reading the following detailed description of the embodiments that are illustrated in the various drawing figures.
The accompanying drawings, which are incorporated in and form a part of this specification and in which like numerals depict like elements, illustrate embodiments of the present disclosure and, together with the detailed description, serve to explain the principles of the disclosure.
Reference will now be made in detail to the various embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings. While described in conjunction with these embodiments, it will be understood that they are not intended to limit the disclosure to these embodiments. On the contrary, the disclosure is intended to cover alternatives, modifications and equivalents, which may be included within the spirit and scope of the disclosure as defined by the appended claims. Furthermore, in the following detailed description of the present disclosure, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. However, it will be understood that the present disclosure may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to unnecessarily obscure aspects of the present disclosure.
Some portions of the detailed descriptions that follow are presented in terms of procedures, logic blocks, processing, and other symbolic representations of operations on data bits within a computer memory. These descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. In the present application, a procedure, logic block, process, or the like, is conceived to be a self-consistent sequence of steps or instructions leading to a desired result. The steps are those utilizing physical manipulations of physical quantities. Usually, although not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computing system. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as transactions, bits, values, elements, symbols, characters, samples, pixels, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the present disclosure, discussions utilizing terms such as “accessing,” “allocating,” “storing,” “receiving,” “sending,” “writing,” “reading,” “transmitting,” “loading,” “pushing,” “pulling,” “processing,” “caching,” “routing,” “determining,” “selecting,” “requesting,” “synchronizing,” “copying,” “mapping,” “updating,” “translating,” “generating,” “allocating,” or the like, refer to actions and processes of an apparatus or computing system (e.g., the methods of
Some elements or embodiments described herein may be discussed in the general context of computer-executable instructions residing on some form of computer-readable storage medium, such as program modules, executed by one or more computers or other devices. By way of example, and not limitation, computer-readable storage media may comprise non-transitory computer storage media and communication media. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular abstract data types. The functionality of the program modules may be combined or distributed as desired in various embodiments.
Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, double data rate (DDR) memory, random access memory (RAM), static RAMs (SRAMs), or dynamic RAMs (DRAMs), read only memory (ROM), electrically erasable programmable ROM (EEPROM), flash memory (e.g., an SSD) or other memory technology, compact disk ROM (CD-ROM), digital versatile disks (DVDs) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store the desired information and that can be accessed to retrieve that information.
Communication media can embody computer-executable instructions, data structures, and program modules, and includes any information delivery media. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared and other wireless media. Combinations of any of the above can also be included within the scope of computer-readable media.
The system 200 and NPU_0 can include elements or components in addition to those illustrated and described below, and elements or components can be arranged as shown in the figure or in a different way. Some of the blocks in the example system 200 and NPU_0 may be described in terms of the function they perform. Where elements and components of the system are described and illustrated as separate blocks, the present disclosure is not so limited; that is, for example, a combination of blocks/functions can be integrated into a single block that performs multiple functions. The system 200 can be scaled up to include additional NPUs, and is compatible with different scaling schemes including hierarchical scaling schemes and flattened scaling schemes.
In general, the system 200 includes a number of compute nodes or servers, and each compute node or server includes a number of parallel computing units or chips (e.g., NPUs). In the example of
In the embodiments of
In the embodiments of
The server 202 includes elements like those of the server 201. That is, in embodiments, the servers 201 and 202 have identical structures (although cm′ may or may not be equal to ‘n’), at least to the extent described herein. Other servers in the system 200 may be similarly structured.
The NPUs on the server 201 can communicate with (are communicatively coupled to) each other over the bus 208. The NPUs on the server 201 can communicate with the NPUs on the server 202 over the network 240 via the buses 208 and 209 and the NICs 206 and 207.
In general, each of the NPUs on the server 201 includes elements such as, but not limited to, a processing core and memory. Specifically, in the embodiments of
NPU_0 may also include other functional blocks or components (not shown) such as a command processor, a direct memory access (DMA) block, and a PCIe block that facilitates communication to the PCIe bus 208. The NPU_O can include elements and components other than those described herein or shown in
Other NPUs on the servers 201 and 202 include elements and components like those of the NPU_0. That is, in embodiments, the NPUs on the servers 201 and 202 have identical structures, at least to the extent described herein.
The system 200 of
In the example of
The actual connection topology (which NPU is connected to which other NPU) is a design or implementation choice.
Communication between NPUs can be at the command level (e.g., a DMA copy) and at the instruction level (e.g., a direct load or store). The ICN 250 allows servers and NPUs in the system 200 to communicate without using the PCIe bus 208, thereby avoiding its bandwidth limitations and relative lack of speed.
Communication between NPUs includes the transmission of memory access requests (e.g., read requests and write requests) and the transfer of data in response to such requests. Communication between any two NPUs—where the two NPUs may be on the same server or on different servers—can be direct or indirect.
Direct communication is over a single link between the two NPUs, and indirect communication occurs when information from one NPU is relayed to another NPU via one or more intervening NPUs. For example, in the configuration exemplified in
In embodiments, the system 200 incorporates unified memory addressing space using, for example, the partitioned global address space (PGAS) programming model. Accordingly, memory space in the system 200 can be globally allocated so that the HBMs 216 on the NPU_0, for example, are accessible by the NPUs on that server and by the NPUs on other servers in the system 200, and the NPUs on the NPU_0 can access the HBMs on other NPUs/servers in the system. Thus, in the example of
The server 201 is coupled to the ICN 250 by the ICN subsystem 230 (
In the configuration of
The NPU 300 of
The ICN subsystem 230 includes ICN communication command rings (e.g., the communication command ring 312; collectively, the communication command rings 312) coupled to the compute command rings 302. The communication command rings 312 may be implemented as a number of buffers. There may be a one-to-one correspondence between the communication command rings 312 and the compute command rings 302. In an embodiment, there are 16 compute command rings 302 and 16 communication command rings 312.
In the embodiments of
More specifically, when a compute command is decomposed and dispatched to one (or more) of the cores 212, a kernel (e.g., a program, or a sequence of processor instructions) will start running in that core or cores. When there is a memory access instruction, the instruction is issued to memory: if the memory address is determined to be a local memory address, then the instruction goes to a local HBM 216 via the NoC 210; otherwise, if the memory address is determined to be a remote memory address, then the instruction goes to the instruction dispatch block 306.
The ICN subsystem 230 also includes a number of chip-to-chip (C2C) DMA units (e.g., the DMA unit 308; collectively, the DMA units 308) that are coupled to the command and instruction dispatch blocks 304 and 306. The DMA units 308 are also coupled to the NoC 210 via C2C fabric 309 and a network interface unit (NIU) 310, and are also coupled to the switch 234, which in turn is coupled to the ICLs 236 that are coupled to the ICN 250.
In an embodiment, there are 16 communication command rings 312 and seven DMA units 308. There may be a one-to-one correspondence between the DMA units 308 and the ICLs 236. The command dispatch block 304 maps the communication command rings 312 to the DMA units 308 and hence to the ICLs 236. The command dispatch block 304, the instruction dispatch block 306, and the DMA units 308 may each include a buffer such as a first-in first-out (FIFO) buffer (not shown).
The ICN communication control block 232 maps an outgoing memory access request to an ICL 236 that is selected based on the address in the request. The ICN communication control block 232 forwards the memory access request to the DMA unit 308 that corresponds to the selected ICL 236. From the DMA unit 308, the request is then routed by the switch 234 to the selected ICL.
An incoming memory access request is received by the NPU 300 at an ICL 236, forwarded to the DMA unit 308 corresponding to that ICL, and then forwarded through the C2C fabric 309 to the NoC 210 via the NIU 310. For a write request, the data is written to a location in an HBM 216 corresponding to the address in the memory access request. For a read request, the data is read from a location in an HBM 216 corresponding to the address in the memory access request.
In embodiments, synchronization of the compute command rings 302 and communication command rings 312 is achieved using FENCE and WAIT commands. For example, a processing core 212 of the NPU 300 may issue a read request for data for a processing task, where the read request addresses an NPU other than the NPU 300. A WAIT command in the compute command ring 302 prevents the core 212 from completing the task until the requested data is received. The read request is pushed into a compute command ring 302, then to a communication command ring 312. An ICL 236 is selected based on the address in the read request, and the command dispatch block 304 or the instruction dispatch block 306 maps the read request to the DMA unit 308 corresponding to the selected ICL 236. Then, when the requested data is fetched from the other NPU and loaded into memory (e.g., an HBM 216) of the NPU 300, the communication command ring 312 issues a sync command (FENCE), which notifies the core 212 that the requested data is available for processing. More specifically, the FENCE command sets a flag in the WAIT command in the compute command ring 302, allowing the core 212 to continue processing the task. Additional discussion is provided below in conjunction with
Continuing with the discussion of
Table 1 provides an example of programming at the command level in the push mode, where the NPU 401, referred to as NPU0 and the producer, is pushing data to the NPU 402, referred to as NPU1 and the consumer.
In the example of Table 1, NPU0 has completed a processing task, and is to push the resultant data to NPU1. Accordingly, NPU0 copies (writes) data from a local buffer (buff1) to address a1 (an array of memory at a1) on NPU1, and also copies (writes) data from another local buffer (buff2) to address a2 (an array of memory at a2) on NPU1. Once both write requests in the communication command ring 312 are completed, NPU0 uses the ICN_FENCE command to set a flag (e1) on NPU1. On NPU1, the WAIT command in the compute command ring 302 is used to instruct NPU1 to wait until the flag is set. When the flag is set in the WAIT command, then NPU1 knows that both write operations are completed and the data can be used.
The example of Table 1 is illustrated in
The command dispatch block 304 also includes a routing table that identifies which ICL 236 is to be used to route the write requests from the NPU 401 to the NPU 402 based on the addresses in the write requests, as previously described herein. Once the write requests are completed (once the data is written to the HBM 216 on the NPU 402), the flag in the WAIT command is set using the FENCE command (rFENCE), as described above. The compute command ring 302 on the NPU 402 includes, in order, the WAIT command (Wait) and the first and second use commands (use1 and use2). When the flag is set in the WAIT command, the use commands in the compute command ring 302 can be executed, and are used to instruct the appropriate processing core on the NPU 402 that the data in the HBM 216 is updated and available.
Table 2 provides an example of programming at the command level in the pull mode, where the NPU 502, referred to as NPU1 and the consumer, is pulling data from the NPU 501, referred to as NPU0 and the producer.
In the example of Table 2, NPU0 allocates local buffers a1 and a2 in the HBM 216. Once both buffers are allocated, NPU0 uses the FENCE command in the compute command ring 302 to set a flag (e1) on NPU1. On NPU1, the WAIT command in the communication command ring 312 is used to instruct NPU1 to wait until the flag is set. When the flag is set in the WAIT command, then NPU1 is instructed that both buffers are allocated and the read requests can be performed.
The example of Table 2 is illustrated in
In the following discussion, the term “warp” is used to refer to the basic unit of execution: the smallest unit or segment of a program that can run independently, although there can be data-level parallelism. While that term may be associated with a particular type of processing unit, embodiments according to the present disclosure are not so limited.
Each warp includes a collection of a number of threads (e.g., 32 threads). Each of the threads executes the same segment of an instruction, but has its own input and output data.
With reference to
Operands associated with a processing task (e.g., a warp) that is being executed by the NPU 601 are stored in the ping-pong buffer 610. For example, an operand is read from the pong buffer 610b, that operand is used by the task to produce a result, and the result is written to (using a remote load instruction) to the ping buffer 612a on the NPU 602. Next, an operand is read from the ping buffer 610a, that operand is used by the task to produce another result, and that result is written to (using a remote load instruction) to the pong buffer 612b on the NPU 602. The writes (remote loads) are performed using the instruction dispatch block 306 and the C2C DMA units 308.
Tables 3 and 4 provide an example of programming at the instruction level in the push mode, where the NPU 601, referred to as NPU0 and the producer, is pushing data to the NPU 602, referred to as NPU1 and the consumer.
In the example of
To accomplish that, one of the threads (e.g., the first thread in the first warp, warp-0; see Table 3) running on the NPU 601 is selected as a representative of the thread block that includes the warps. The selected thread communicates with a thread on the NPU 602. Specifically, once all of the data is loaded and ready on the NPU 601, the selected thread uses a subroutine (referred to in Table 3 as the threadfence_system subroutine) to determine that, and then to set a flag (marker) on the NPU 602 to indicate to the NPU 602 that all of the writes (remote loads) have been completed.
The examples of Tables 3 and 4 include only a single warp (warp-0); however, as noted, there can be multiple warps operating in parallel, with each warp executing the same instructions shown in these tables but with different inputs and outputs. Also, while Tables 3 and 4 are examples of a push mode, the present disclosure is not so limited, and instruction-level programming can be performed for a pull mode.
All or some of the operations represented by the blocks in the flowcharts of
In block 702 of
In block 704 of
In block 706, the first processing unit routes the memory access request to the selected interconnect and consequently to the second processing unit. When the memory access request is a read request, the first processing unit receives data from the second processing unit over the interconnect.
In block 802 of
In block 804, the first processing unit sets a flag on the second processing unit. The flag, when set, allows the second processing unit to use the data pushed from the first processing unit.
In block 902 of
In block 904, the first processing unit sets a flag on the second processing unit to indicate the memory on the first processing unit is allocated. The flag, when set, allows the second processing unit to read the data from the memory on the first processing unit.
In block 1002 of
In block 1004, the first processing unit writes, to a buffer on the second processing unit, a result of the processing task.
In block 1006, the first processing unit selects a thread of the processing task.
In block 1008, the first processing unit sets a flag on the second processing unit using the thread. The flag indicates to the second processing unit that all writes associated with the processing task are completed.
In summary, embodiments according to the present disclosure provide an improvement in the functioning of computing systems in general and applications such as, for example, neural networks and AI workloads that execute on such computing systems. More specifically, embodiments according to the present disclosure introduce methods, programming models. and systems that increase the speed at which applications such as neural network and AI workloads can be operated, by increasing the speeds at which memory access requests (e.g., read requests and write requests) between elements of the system are transmitted and resultant data transfers are completed.
While the foregoing disclosure sets forth various embodiments using specific block diagrams, flowcharts, and examples, each block diagram component, flowchart step, operation, and/or component described and/or illustrated herein may be implemented, individually and/or collectively, using a wide range of configurations. In addition, any disclosure of components contained within other components should be considered as examples because many other architectures can be implemented to achieve the same functionality.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in this disclosure is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing this disclosure.
Embodiments according to the invention are thus described. While the present invention has been described in particular embodiments, the invention should not be construed as limited by such embodiments, but rather construed according to the following claims.
Number | Date | Country | Kind |
---|---|---|---|
202111561477.3 | Dec 2021 | CN | national |