With the exponential growth on the neural network based deep learning applications across the business units, the commodity central processing unit (CPU)/graphic processing unit (GPU) based platform is no longer a suitable computing substrate to support the ever-growing computation demands in terms of performance, power efficiency, and economic scalability. Developing neural network processors to accelerate neural-network-based deep learning applications has gained significant traction across many business segments, including established integrated chip (IC) manufacturers, start-up companies as well as large Internet companies.
The existing Neural network Processing Units (MPUs) or Tensor Processing Units (TPUs) feature a programmable deterministic execution pipeline. The key parts of this pipeline may include a matrix unit with 256×256 of 8-bit Multiplier-Accumulator units (MACs) and a 24 mebibyte (MiB) memory buffer. However, as the semiconductor technology progresses towards 7 nm node, the transistor density is expected to increase more than 10×. In such configurations, enabling efficient data transfer may require increasing the size of the matrix unit and the buffer size, potentially creating more challenges.
The present disclosure relates to a machine learning accelerator system and methods for exchanging data therein. The machine learning accelerator system may include a switch network comprising an array of switch nodes and an array of processing elements. Each processing element of the array of processing elements may be connected to a switch node of the array of switch nodes and is configured to generate data that is transportable via the switch node. The generated data may be transported in one or more data packets, the one or more data packets comprising information related with a location of the destination processing element, a storage location within the destination processing element, and the generated data.
The present disclosure provides a method of transporting data in a machine learning accelerator system. The method may comprise receiving input data from a data source, using a switch node of an array of switch nodes of a switch network. The method may include generating output data based on the input data, using a processing element that is connected to the switch node and is part of an array of processing elements; and transporting the generated output data to a destination processing element of the array of processing elements via the switch network using the switch node.
Consistent with some disclosed embodiments, a computer-readable storage medium comprises a set of instructions executable by at least one processor to perform the aforementioned method is provided.
Consistent with other disclosed embodiments, a non-transitory computer readable storage media may store program instructions, which are executed by at least one processing device and perform the aforementioned method described herein.
Embodiments and various aspects of the present disclosure are illustrated in the following detailed description and the accompanying figures. Various features shown in the figures are not drawn to scale.
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise represented. The implementations set forth in the following description of exemplary embodiments do not represent all implementations consistent with the invention. Instead, they are merely examples of apparatuses and methods consistent with aspects related to the invention as recited in the appended claims.
As indicated above, conventional accelerators have several flaws. For example, the conventional Graphic Processing Units (GPUs) may feature thousands of shader cores with a full instruction set, a dynamic scheduler of work, and a complicated memory hierarchy, causing large power consumption and extra work for deep learning workloads.
Conventional Data Processing Units (DPUs) may feature a data-flow based coarse grain reconfigurable architecture (CGRA). This CGRA may be configured as a mesh of 32×32 clusters and each cluster may be configured as 16 dataflow processing elements (PEs). Data may be passed through this mesh by PEs passing data directly to their neighbours. This may require PEs to spend several cycles to pass data instead of focusing on computing, rendering the dataflow inefficient.
The embodiments of the present invention overcome these issues of conventional accelerators. For example, the embodiments provide a light-weighted switch network, thereby allowing the PEs to focus on computing. Moreover, the computing and storage resources are distributed across many PEs. With the help of 2D mesh connections, data may be communicated among the PEs. Software can flexibly divide the workloads and data of neural network to the arrays of PEs, and programs the data flows accordingly. For similar reasons, it is easy to add additional-resources without increasing the difficulty of packing more work and data.
On-chip communication system 102 may include a global manager 122 and a plurality of processing elements 124. Global manager 122 may include one or more task managers 126 configured to coordinate with one or more processing elements 124. Each task manager 126 may be associated with an array of processing elements 124 that provide synapse/neuron circuitry for the neural network. For example, the top layer of processing elements of
Processing elements 124 may include one or more processing elements that each include single instruction multiple data (SIMD) architecture including one or more processing units configured to perform one or more operations (e.g., multiplication, addition, multiply-accumulate, etc.) on the communicated data under the control of global manager 122. To perform the operation on the communicated data packets, processing elements 124 may include a core and a memory buffer. Each processing element may comprise any number of processing units. In some embodiments, processing element 124 may be considered a tile or the like.
Host memory 104 may be off-chip memory such as a host CPU's memory. For example, host memory 104 may be a double data rate synchronous dynamic random-access memory (DDR-SDRAM) memory, or the like. Host memory 104 may be configured to store a large amount of data with slower access speed, compared to the on-chip memory integrated within one or more processor, acting as a higher-level cache.
Memory controller 106 may manage the reading and writing of data to and from a memory block (e.g., HBM2) within global memory 116. For example, memory controller 106 may manage read/write data coming from an external chip communication system (e.g., from DMA unit 108 or a DMA unit corresponding with another NPU) or from on-chip communication system 102 (e.g., from a local memory in processing element 124 via a 2D mesh controlled by a task manager 126 of global manager 122). Moreover, while one memory controller is shown in
Memory controller 106 may generate memory addresses and initiate memory read or write cycles. Memory controller 106 may contain several hardware registers that may be written and read by the one or more processors. The registers may include a memory address register, a byte-count register, one or more control registers, and other types of registers. These registers may specify some combination of the source, the destination, the direction of the transfer (reading from the input/output (I/O) device or writing to the I/O device), the size of the transfer unit, the number of bytes to transfer in one burst, and/or other typical features of memory controllers.
DMA unit 108 may assist with transferring data between host memory 104 and global memory 116. In addition, DMA unit 108 may assist with transferring data between multiple accelerators. DMA unit 108 may allow off-chip devices to access both on-chip and off-chip memory without causing a CPU interrupt. Thus, DMA unit 108 may also generate memory addresses and initiate memory read or write cycles. DMA unit 108 also may contain several hardware registers that may be written and read by the one or more processors, including a memory address register, a byte-count register, one or more control registers, and other types of registers. These registers may specify some combination of the source, the destination, the direction of the transfer (reading from the input/output (I/O) device or writing to the I/O device), the size of the transfer unit, and/or the number of bytes to transfer in one burst. It is appreciated that accelerator architecture 100 may include a second DMA unit, which may be used to transfer data between other accelerator architectures to allow multiple accelerator architectures to communicate directly without involving the host CPU.
JTAG/TAP controller 110 may specify a dedicated debug port implementing a serial communications interface (e.g., a JTAG interface) for low-overhead access to the accelerator without requiring direct external access to the system address and data buses. The JTAG/TAP controller 110 may also have an on-chip test access interface (e.g., a TAP interface) that is configured to implement a protocol to access a set of test registers that present chip logic levels and device capabilities of various parts.
Peripheral interface 112 (such as a PCIe interface), if present, may serve as an (and typically the) inter-chip bus, providing communication between the accelerator and other devices.
Bus 114 includes both intra-chip bus and inter-chip buses. The intra-chip bus connects all internal components to one another as called for by the system architecture. While not all components are connected to every other component, all components do have some connection to other components they need to communicate with. The inter-chip bus connects the accelerator with other devices, such as the off-chip memory or peripherals. Typically, if there is a peripheral interface 112 (e.g., the inter-chip bus), bus 114 is solely concerned with intra-chip buses, though in some implementations it could still be concerned with specialized inter-bus communications.
While accelerator architecture 100 of
Reference is now made to
As illustrated in
In some embodiments, NPU 202 may comprise a compiler (not shown). The compiler may be a program or a computer software that transforms computer code written in one programming language into NPU instructions to create an executable program. In machining applications, a compiler may perform a variety of operations, for example, pre-processing, lexical analysis, parsing, semantic analysis, conversion of input programs to an intermediate representation, code optimization, code generation, or combinations thereof.
In some embodiments, the compiler may be on a host unit (e.g., host CPU 208 or host memory 210 of
It is appreciated that the first few instructions received by the processing element may instruct the processing element to load/store data from the global memory into one or more local memories of the processing element (e.g., a memory of the processing element or a local memory for each active processing element). Each processing element may then initiate the instruction pipeline, which involves fetching the instruction (e.g., via a fetch unit) from the local memory, decoding the instruction (e.g., via an instruction decoder) and generating local memory addresses (e.g., corresponding to an operand), reading the source data, executing or loading/storing operations, and then writing back results.
Host CPU 208 may be associated with host memory 210 and disk 212. In some embodiments, host memory 210 may be an integral memory or an external memory associated with host CPU 208. Host memory 210 may be a local or a global memory. In some embodiments, disk 212 may comprise an external memory configured to provide additional memory for host CPU 208.
Reference is now made to
In some embodiments, switch network 302 may include an array of switch nodes 304. Switch nodes 304 may be arranged in a manner to form a two-dimensional (2D) array of switch nodes 304. In some embodiments, as illustrated in
As illustrated in
In some embodiments, switch node 304 may be configured to respond to processing element 306 based on an operating status of switch node 304. For example, if switch node 304 is busy routing data packets, switch node 304 may reject or temporarily push back data packets from processing element 306. In some embodiments, switch node 304 may re-route data packets, for example, switch node 304 may change the flow direction of data packets from a horizontal path to a vertical path, or from a vertical path to a horizontal path, based on the operating status or the overall system status.
In some embodiments, switch network 302 may comprise a 2D array of switch nodes 304, each switch node connecting to a corresponding individual processing element 306. Switch nodes 304 may be configured to transfer data from one location to another, while processing element 306 may be configured to compute the input data to generate output data. Such a distribution of computing and transferring resources may allow switch network 302 to be light-weight and efficient. A light-weight 2D switch network may have some or all of the advantages discussed herein, among others.
In some embodiments, DMA unit 308 may be similar to DMA unit 108 of
Deep learning accelerator system 300 may comprise host CPU 310. In some embodiments, host CPU 310 may be electrically connected with control unit 314. Host CPU 310 may also be connected to peripheral interface 312 and high bandwidth interface 318. DMA unit 308 may communicate with host CPU 310 or high bandwidth memory 316 through high bandwidth memory interface 318. In some embodiments, high bandwidth memory 316 may be similar to global memory 116 of deep learning accelerator system 100, shown in
Reference is now made to
Reference is now made to
In some embodiments, a horizontal pipelined data transfer, as illustrated in
As an example,
In some embodiments, processing element 306 associated with switch node 304 may be configured to receive data packet (e.g., data packet 400 of
Reference is now made to
In some embodiments, a vertical pipelined data transfer, as illustrated in
Reference is now made to
In some embodiments, the direction of data flow may be determined by a software before being executed or before runtime. For example, the software may determine a horizontal data flow in a pipelined manner when processing elements 306 generate output data including computation results, and the software may determine a vertical data flow in a pipelined manner when processing elements 306 share input data with their neighboring processing elements.
Reference is now made to
In step 810, a switch node (e.g., switch node 304 of
DMA unit may assist with transferring data between a host memory (e.g., local memory of host CPU) and a high bandwidth memory (e.g., high bandwidth memory 316 of
Switch nodes may be configured to receive input data and transport the received input data or the output data from the processing elements to the destination location within the switch network. The mesh switch network may enable point-to-point data communication between the 2D array of processing elements.
In step 820, a processing element (e.g., processing element 306 of
The processing element may comprise a processor core (e.g., processor core 320 of
The processing element may comprise a local memory or a global shared memory. The local memory of the processing element may be accessed by processor core 320 of the processing element, whereas the global shared memory may be accessed by any processor core of any processing element in the mesh switch network.
In step 830, the generated output data or data packet may be transported to the destination processing element based on the destination information stored in the memory buffer of the processing element. Data may be transported to the destination processing element through one or more routes. The data transportation route may be based on a pre-defined configuration of at the array of switch nodes or the array of processing elements in the mesh switch network. A software, or a firmware, or a computer executable program may determine the route prior to runtime.
In some embodiments, the data or the data packet may be transported along a route determined by statically analyzing at least data flow patterns, or data flow traffic, or data volume, etc. The software (e.g., such as a compiler in a host CPU) may also schedule the tasks for processing elements and program the processing elements to generate data packets that avoid congestion and deadlocks. The determined route may be a horizontal path as shown in
The various example embodiments described herein are described in the general context of method steps or processes, which may be implemented in one aspect by a computer program product, embodied in a computer-readable medium, including computer-executable instructions, such as program code, executed by computers in networked environments. A computer-readable medium may include removeable and nonremovable storage devices including, but not limited to, Read Only Memory (ROM), Random Access Memory (RAM), compact discs (CDs), digital versatile discs (DVD), etc. Generally, program modules may include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps or processes.
In the foregoing specification, embodiments have been described with reference to numerous specific details that may vary from implementation to implementation. Certain adaptations and modifications of the described embodiments may be made. Other embodiments may be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims. It is also intended that the sequence of steps shown in figures are only for illustrative purposes and are not intended to be limited to any particular sequence of steps. As such, those skilled in the art may appreciate that these steps may be performed in a different order while implementing the same method.
In the drawings and specification, there have been disclosed exemplary embodiments. However, many variations and modifications may be made to these embodiments. Accordingly, although specific terms are employed, they are used in a generic and descriptive sense only and not for purposes of limitation, the scope of the embodiments being defined by the following claims.
This application is based upon and claims priority to U.S. Provisional Application No. 62/621,368 filed Jan. 24, 2018 and entitled “Deep Learning Accelerator Method Using a Light Weighted Mesh Network with 2D Processing Unit Array,” which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62621368 | Jan 2018 | US |