The present invention relates to a computing device to accelerate artificial neural networks and other machine learning algorithms as required in neural networks
An artificial intelligence (AI) accelerator is a class of microprocessor or computer system designed to accelerate artificial neural networks and other machine learning algorithms. A typical neural network has many layers with many nodes on each layer. The nodes are connected by arcs and each node has an activation function. For inference, each node must (i) multiply input data from each arc by appropriate weights; (ii) add the results from the multiplications; and (iii) apply an activation function. Training a neural network comprises determining the weights for all of the arcs and determining the activation functions for the nodes. Large neural networks can have thousands of nodes and millions of arcs, thereby requiring an enormous amount of calculations for training of and inference by the network. Recently, AI processors have been introduced for such large scale computations. Typically an AI accelerator has many processing cores and uses low-precision arithmetic. For example, some AI accelerators are implemented using ASICs (application specific integrated circuits) that comprise over 65,000, 8-bit integer multipliers.
The activation functions mentioned above are intended for illustration instead of an exhaustive list of all activation functions. In practice, other activation functions, such as Softmax function, are also being used.
An AI (Artificial Intelligence) processor for Neural Network (NN) Processing is disclosed. The AI processor comprises a Core Computing Unit (CCU) comprising at least two Core Computing Elements (CCEs), a unified buffer coupled to the first CCE and the second CCE to store data and a control circuitry coupled to the CCU and the unified buffer. A first Core Computing Element (CCE), corresponding to one level of the CCU, comprises a plurality of Processing Elements (PEs), where each PE comprises a multiplier array, an adder tree and an accumulator. A second CCE, corresponding to another level of the CCU, coupled to output of the first CCE, where the second CCE comprises a plurality of Scalar Elements (SE), and each SE is configured to generate an output of one target activation function for an input to said each SE. The AI processor is configured to perform the NN processing for a plurality of users. At one time instance, at least one of said at least two CCEs is divided into at least two groups to allow at least two users of the plurality of users to share concurrently; and at least one part of one of said at least two CCEs is allocated to a first user of the plurality of users and at least one part of another of said at least two CCEs is allocated to a second user of the plurality of users, and wherein at another time instance after said one time instance, said at least one part of one of said at least two CCEs is allocated to one next user other than the first user of the plurality of users or said at least one part of another of said at least two CCEs is allocated to one next user other than the second user of the plurality of users. Accordingly, the AI processor according to the present invention discloses a flexible AI processor virtualization to allow multiple users to share the resource in space division as well as in time division.
The unified buffer can store activation data for a current layer, one or more next layers, one or more previous layers, or a combination thereof. Furthermore, the unified buffer can store output of one SE for the current layer, and the output of one SE for the current layer can be provided to one PE as the activation data for one next layer. In one embodiment, the unified buffer is implemented based on dual-port memory including a read port and a write port. Furthermore, the read port can be coupled to a read arbiter to arbitrate read request from the first CCE, the second CCE and one data multiplexer, and the write port can be coupled to a write arbiter to arbitrate write request from the first CCE, and one data multiplexer.
In one embodiment, the control circuitry comprises a command sequencer to send commands to the first CCE, the second CCE and the unified buffer to move data around or to control computations for the NN processing. The control circuitry may be coupled to a host CPU (central processing unit) to receive commands for the NN processing.
The AI processor may further comprise one or more data multiplexes coupled to the CCU, the control circuitry and the unified buffer to switch data.
In one embodiment, the first CCE comprises a weight buffer to store weights for the NN processing. The control circuitry can be further configured to fetch activation data from the unified buffer and weight data from the weight buffer to compute vector multiplication of the activation data and the weight data.
In one embodiment, each PE comprises an array of FP16 (floating point 16-bit) multipliers and each FP16 multiplier is configured as one FP16 multiplier or two int8 (integer 8-bit) multipliers.
The target activation function can be selected out of an activation function pool. Each SE may comprise a linear function core, a nonlinear function core, pooling function core, a cross channel function core, a programmable function core, a training core, or a combination thereof.
The AI processor may further comprise an interconnection interface to access on-chip configuration registers and memories through an external bus. The interconnection interface may correspond to PCIe (Peripheral Component Interconnect Express)/DMA (Direct Memory Access) block. The PCIe/DMA block may be used to transfer data between a host memory and both on-chip and off-chip memories by using AXI stream interfaces.
In one embodiment, said at least one of said at least two CCEs is divided into two unequal groups for two users of the plurality of users to share concurrently.
The following description is of the best-contemplated mode of carrying out the invention. This description is made for the purpose of illustrating the general principles of the invention and should not be taken in a limiting sense. The scope of the invention is best determined by reference to the appended claims.
It will be readily understood that the components of the present invention, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following more detailed description of the embodiments of the systems and methods of the present invention, as represented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention.
Reference throughout this specification to “one embodiment,” “an embodiment,” or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present invention. Thus, appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, etc. In other instances, well-known structures, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
The illustrated embodiments of the invention will be best understood by reference to the drawings, wherein like parts are designated by like numerals throughout. The following description is intended only by way of example, and simply illustrates certain selected embodiments of apparatus and methods that are consistent with the invention as claimed herein.
In the description like reference numbers appearing in the drawings and description designate corresponding or like elements among the different views.
Various embodiments of the present invention are directed to virtualization of an AI accelerator. Before describing the virtualization aspects of the present invention, a brief description of an exemplary AI accelerator is provided.
The AI accelerator depicted in
As is known for NN processing, in each layer of the NN processing, the activation data are multiplied by respective weights. The weighted sum is then provided to the input of an activation function. The output signal from the activation function becomes the activation data for the next layer of the NN processing. The MXU is primarily responsible to calculate the weighted sum for the activation data. On the other hand, the SCU is primarily responsible for computing the output for a selected activation function. The CCU comprises two main elements, i.e., the MXU 140 and the SCU 170. For convenience, both MXU and SCU are referred as Core Computing Elements (CCEs). In other words, the MXU may be referred as a first CCE and the SCU may be referred as a second CCE.
The SCU 170 as shown in
A configuration control (cfgctl) 122 can be connected to an interconnection core, such as the PCIe (Peripheral Component Interconnect Express)/DMA (Direct Memory Access) block 120 shown in
A command sequencer (cmdseq) 124 preferably receives commands from a host CPU (not shown) and sends commands to various blocks to either move data around or to start computation for a neural network. The cmdseq 124 can also control the config bus 123 to access on chip configuration registers and memories in various blocks. Furthermore, the cmdseq 124 can also control the Command Bus 125. Cmdseq 124 may also comprise an arithmetic unit to compute various addresses, offsets or controls to assist virtualization of address space and multi-user data allocation. The cfgctl 122 may also comprise arbitration logic to coordinate the access requests from both the PCIe/DMA 120 and the cmdseq 124. The cmdseq 124 can also program the PCIe/DMA block 120 via an AXI-Lite slave interface 127. The PCIe/DMA block 120 may also be used to transfer data between the host memory and both on-chip and off-chip memories by using AXI stream interfaces 129. The AI accelerator may comprise, for example, 16 MB or so of on-chip memory and several GB of off-chip memory. A data mover (DMV) block 113 is used to control data transfer between on-chip memories and off-chip memories via a 2 kb wide ring bus 117, which comprises three Ring Nodes (115, 115 and 116) as an example.
The configuration control block preferably comprises three interfaces: AXI-Lite master interface, cmdseq-config interface 118, and internal config bus interface 119. The AXI-Lite interface 127 is for the host CPU to configure the chip while the cmdseq interface 118 is for the cmdseq 124 to configure the chip. The internal config bus 119 may consist of the following signals: 48 b address/data bus; read/write signal; Request valid signal; Write acknowledge signal; Read data valid signal; and 32 b read data bus.
A data mux block 112 can switch data from (1) the PCIe/DMA block 120, (2) config bus 123, and (3) DMV 113 to both on-chip and off-chip memories. Three data mux blocks (112, 131 and 132) may be used: one (i.e., data mux 112) for off-chip memory (i.e., DDR or HBM 111) via a memory controller block or memory management block (MemMan 110), one (i.e., data mux 131) for AB 133, and another one (i.e., data mux 132) for WB 150 (in
The main purpose of the command sequencer 124 (in
The SCU 170 (in
The cmdseq also controls data movement between on chip and off chip memories. It preferably can program the PCIe/DMA controller to control data movement between on chip/off chip memories and the host memory. In various embodiments, essentially all address generation or determination will be done by the cmdseq with the rest of the chip using the generated addresses or their increments to fetch data for computing or data transfer. Accordingly, within the data movement control block 230, a descriptor generation block 231 is used to generate addresses or their increments for fetching data for computing or data transfer.
The memory management block may have interface to the data mux and off-chip memory controller blocks. The memory management block may have no configurable features although the off-chip memory controller block usually has some. The memory management block is mainly used as a bridge between the data mux interfaces and the memory controller interface. Accordingly, the command sequencer 200 in
Preferably, the memory management is not responsible for performance tuning because the memory controller block usually comes with performance tuning features. The host CPU preferably configures the memory controller block properly to maximize off-chip memory access efficiency. Since the weight buffer and the unified buffer are preferably sized to support most known neural networks, off-chip memory performance tuning may not be critical.
The total size of the unified buffer can be 64 MB, for example. Each type of data can be stored in the allocated partition. If the allocated partition is not big enough to hold all the required data for operations, the AB can work as an on-chip cache, with all data stored in off-chip memory.
For data that will be accessed in a sequential manner, the cache can be implemented as a FIFO. Each cache FIFO may consist of multiple cache lines with configure line sizes (or FIFO depth). A FIFO may be pushed with a configurable number of lines whenever its occupancy has fallen below a configurable threshold. To maximize off-chip memory bandwidth, the configurable number of lines should be big enough.
The weight buffer can be implemented, for example, with 256 separate memories each for one of the 256 PEs. The weight buffer preferably is sized to be big enough to hold the weights of the entire neural network. When the weight buffer is not big enough, additional weights can be loaded from off-chip memory or from the Unified Buffer by using DMV configured by the cmdseq.
The activation feeder 145 can be responsible for initiating the loading of weights from off-chip memory because it is responsible for traversing the weight matrix. This particular function may also be moved to the cmdseq 124.
The core of each processing element (PE) comprises, for example, 256 int8 multipliers (e.g. activation/weight multiplier array 164 in
There can be two multiplication modes: integer 8-bit (int8) and floating point 16-bit (FP16). In the FP16 mode, two int8 multipliers plus some additional logic can be ganged together for FP16 multiplication.
Each PE can use a number (e.g., 41) of accumulator buffers (e.g. accumulator 166 in
The accumulator buffer memory preferably is implemented with simple dual port memory with one port for read and another port for write. When the full sum is available (i.e., last partial sum of a weight row is available), the activation feeder can send an indication to the SCU so that the SCU can pull it. When a full sum is pulled out of an accumulator memory, the PE preferably stalls because its read port is now being used to read full sum out.
AI accelerators, such as the example described above, may be employed in one or more servers comprising a data center. A server at a data center may comprise one or more CPUs and one or more such AI accelerators. The CPU(s) is/are in communication with the AI accelerator(s) via a high-speed data bus, such as a PCIe bus 121 in
Now that general aspects of an AI accelerator have been described, attention is now turned to the virtualization aspects. A data center where such AI accelerators are employed may process AI-related tasks and computations for numerous concurrent users. As generally described below, in one embodiment, different users can share different components of an AI accelerator at the same time (e.g., “space division”); in another embodiment different users can use the same components of the AI accelerator but at different times (e.g., “time division”); and in yet another embodiment different uses can share different components at different time (e.g., “space and time division”).
In one embodiment of the virtualization, each concurrent user is allocated a separate, virtualized hardware memory space. For example, each concurrent user may be allocated separate hardware memory space in all on-chip and off-chip memories of the AI accelerator. That way, data from the concurrent users are not commingled, i.e., are isolated from each other.
As shown in
For example, as shown in
In the time division virtualization, the components of the AI accelerator perform operations for one user (e.g., User A) for a number of clock cycles and then the components perform operations for another user for another number of clock cycles, and so on for each of the concurrent user. At the end of the operations for one user, all of the data for that user may be stored in the dedicated memory for that user (on-chip or off-chip memory). The data that are stored for each user may comprise the MXU, SCU and UB states for each user. When it is that user's turn again, the data is read out of the dedicated memory for that user and into the various components (e.g., the MXU, SCU and UB) to continue the operations for that user. In that connection, in various embodiments, the address mapping block may comprise a context pointer for each user, which context pointer points to where the MXU, SCU and UB state data for each respective user are stored. The storing and loading of the data to and from the AI accelerator memory (whether on-chip or off-chip) can be performed at a high rate since it is performed with the hardware shown in
The time division between the concurrent users does not need to be equal. For example, the time period for some users could be longer (more clock cycles) than others. Also, the time periods for some user could be more frequent than others. For example, if there 5 users (e.g. Users A, B, C, D and E), the cycle could be A→B→C→D→E→A→B→C→D→E as suggested in
The space and time division virtualization can be a combination of the space division virtualization and the time division virtualization. That is, a first group of users may take turns using a first dedicated set of components in the MXU, SCU and UB, a second group users may take turn using a second dedicated set of components in the MXU, SCU and UB, and so on. The users in one group would take turns by time (i.e., time division), with each user's data being stored in their dedicated memory at the end of their turn, and then re-loaded into the MXU, SCU and UB when their next turn begins.
The AI accelerator hardware along with configurable feature allows time division virtualization, time division virtualization as well as a combination of time division virtualization and time division virtualization (referred as hybrid time-space division virtualization). The hardware architecture as shown in
As mentioned before, the command sequencer joined by the configuration control plays an important role to coordinate the overall operations of the computing core. The command sequencer sends commands to various blocks to either move data around or to start computation for a neural network. The command sequencer and the configuration control are all referred as control circuitry in this disclosure. The hybrid time-space division virtualization as disclosed herein allows dynamic job allocation for multiple users.
In various embodiments disclosed herein, a single component may be replaced by multiple components and multiple components may be replaced by a single component to perform a given function or functions. Except where such substitution would not be operative, such substitution is within the intended scope of the embodiments.
While various embodiments have been described herein, it should be apparent that various modifications, alterations, and adaptations to those embodiments may occur to persons skilled in the art with attainment of at least some of the advantages. The disclosed embodiments are therefore intended to include all such modifications, alterations, and adaptations without departing from the scope of the embodiments as set forth herein.
The above description is presented to enable a person of ordinary skill in the art to practice the present invention as provided in the context of a particular application and its requirement. The invention may be embodied in other specific forms without departing from its spirit or essential characteristics. Therefore, the present invention is not intended to be limited to the particular embodiments shown and described, but is to be accorded the widest scope consistent with the principles and novel features herein disclosed. In the above detailed description, various specific details are illustrated in order to provide a thorough understanding of the present invention. Nevertheless, it will be understood by those skilled in the art that the present invention may be practiced.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), field programmable gate array (FPGA), and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms “machine-readable medium” “computer-readable medium” refers to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor. The software code or firmware codes may be developed in different programming languages and different format or style. The software code may also be compiled for different target platform. However, different code formats, styles and languages of software codes and other means of configuring code to perform the tasks in accordance with the invention will not depart from the spirit and scope of the invention.
This application claims the benefit of U.S. Non-Provisional application Ser. No. 15/956,988, filed Apr. 19, 2018, which claims priority to U.S. Provisional Application No. 62/639,451, filed Mar. 6, 2018. This application also claims the benefit of U.S. Non-Provisional application Ser. No. 16/116,029, filed Aug. 29, 2018, U.S. Provisional Application No. 62/640,804, filed Mar. 9, 2018 and U.S. Provisional Application No. 62/654,761, filed Apr. 9, 2018. The U.S. Non-Provisional application and U.S. Provisional applications are incorporated by reference herein.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2019/020074 | 2/28/2019 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62639451 | Mar 2018 | US | |
62640804 | Mar 2018 | US | |
62654761 | Apr 2018 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15956988 | Apr 2018 | US |
Child | 16975685 | US | |
Parent | 16116029 | Aug 2018 | US |
Child | 15956988 | US |