The present disclosure relates in general to novel configurations of trainable resistive crosspoint devices, which are referred to herein as resistive processing units (RPUs). More specifically, the present disclosure relates to RPU scalable execution.
A method is provided for forming a resistive processing unit (RPU) system. The method includes forming a plurality of RPU tiles, and forming a plurality of RPU chips from the plurality of RPU tiles. The method further includes forming a plurality of RPU compute nodes from the plurality of RPU chips; and connecting the plurality of RPU compute nodes by a high speed and low latency network, forming a plurality of RPU supernodes.
An RPU system is provided. The system includes a plurality of RPU tiles and a plurality of RPU chips, whereby each RPU chip comprises the plurality of RPU tiles The RPU system further includes a plurality of RPU compute nodes, each RPU compute node having a plurality of RPU chips; and a plurality of RPU supernodes, each RPU supernode being a collection of RPU compute nodes, wherein the collection of RPU compute nodes is connected by a high speed and low latency network.
A computer program product for training an RPU system is provided. The computer program product includes a computer-readable storage medium having computer-readable program code embodied therewith. When executed, the computer-readable program code causes the computer to receive at an input layer an activation value from an external source, compute a vector matrix multiplication; and perform non-linear activation on the computed vector matrix. Based on reaching a last input layer, the computer performs backpropagation of the matrix and updates a weight matrix.
Additional features and advantages are realized through techniques described herein. Other embodiments and aspects are described in detail herein. For a better understanding, refer to the description and to the drawings.
The subject matter which is regarded as embodiments is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other features and advantages of the embodiments are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
Training artificial neural networks (ANN) is computationally intensive, even when executing in distributed multi-node parallel computing architectures. Current implementations attempt to accelerate the computing power available to the training by packing larger numbers of computing units, such as GPUs and FPGAs into a fixed area and power budget. However, these are digital approaches that use a similar underlying technology. Therefore, acceleration factors will eventually reach a limit due to limitations on scaling in the technology.
Instead of utilizing the traditional digital model of manipulating zeros and ones, ANNs create connections between processing elements that are substantially the functional equivalent of the physical neural network that is being approximated. For example, a physical neural network can include several neurons that are connected to each other by synapses. The RPU chip approximates this physical construct by being a configuration of several RPU tiles. Each RPU tile is a crossbar array formed of a set of conductive row wires and a set of conductive column wires formed to intersect the set of conductive row wires. The intersections can be considered analogous to synapses, where the row and column wires may be analogous to the neuron connections. Each intersection is an active region that effects a non-linear change in a conduction state of the active region. The active region is configured to locally perform a data storage operation of a training methodology based at least in part on the non-linear change in the conduction state. The active region is further configured to locally perform a data processing operation of the training methodology based at least in part on the non-linear change in the conduction state.
The RPU tiles are configured together through physical connections, such as cabling, and under the control of firmware, as an RPU chip. On-chip network routers perform communications among the individual RPU tiles.
Each array element on the RPU tile receives a variety of analog inputs, in the form of voltages. Based on prior learning, i.e., previous computations, the RPU tile uses a non-linear function to determine the result to pass along to the next set of compute elements. RPU tiles are configured into RPU chips, which can provide improved performance with less power consumption because both data storage and computations are performed locally on the RPU chip. The vector computation results are passed through the RPU tiles on the RPU chips, but not the weights. Additionally, in contrast to traditional digital CPU-based computing, RPU chips are analog resistive devices, meaning computations can be performed without converting data from analog to digital, and without moving the data from the CPU to computer memory and back, as in traditional digital CPU-based computing. Because of these characteristics, computations on the RPU tile and computer chip characteristics are asynchronous and parallel execution at each layer.
Each RPU tile 140 includes neural elements that can be arranged in an array, for example in a 4,096-by-4,096 array. The RPU tile 140 executes the three atomic matrix operations of the forward cycle, backward cycle, and matrix update.
The I/O connections 110 communicate to other hardware components in the cluster, including other RPU chips, to return results, ingest training data, and generally to provide connectivity to other hardware in the configuration.
The NoC 130 moves data between the RPU tiles 140 and the NLFs 120 for linear and non-linear transformations. However, only the neuron data, i.e., the vectors, move but the weight data 150 remains local to the RPU tile 140.
The ANNs are composed of stacking multiple layers (convolutional, fully connected, recurrent etc.) such that the signal propagates from input layer to output layer by going through transformations by the NLFs 120. For each input and output layer, the NLFs 120 transmit the result vector from the array into the RPU tile 140, and returns the result vector from the RPU tile 140. The choice of NLF 120, for example softmax and sigmoid, depends on the requirement of the model being trained. The ANN expresses a single differentiable error function that maps the input data on to class scores at the output layer. Most commonly, the neural network is trained with simple stochastic gradient decent (SGD), in which the error gradient with respect to each parameter is calculated using the backpropagation algorithm. The backpropagation algorithm is composed of three cycles, forward, backward and weight update that are repeated until a convergence criterion is met. Once the information reaches the final output layer, the error signal is calculated and backpropagated through the neural network. Finally, in the update cycle the weight matrix is updated by performing an outer product of the two vectors that are used in the forward and the backward cycles
A compute node, such as the RPU compute node 210, includes several RPU chips 200. The CPUs (or GPUs) execute computer support functions. For example, the operating system manages and controls traditional hardware components in the RPU compute node 210, and is enhanced with firmware that also controls the RPU chips 200 and RPU-related hardware. The RPU compute node 210 also includes an RPU-aware software stack, that includes a runtime for resource management, workload scheduling, and power/performance tuning. An RPU-aware compiler generates RPU instruction set architecture (ISA)-specific executable code. An application that exploits the RPU hardware can include various RPU APIs and RPU ISA specific instructions. However, the application can include traditional non-RPU APIs and instructions, and the RPU-aware compiler can generate both RPU and non-RPU executable code.
The RPU SuperNode 220 is a collection of RPU compute nodes 210 that are connected using a high speed and low latency network, for example, InfiniBand.
The RPU system 230 illustrates only one of several possible RPU hardware configurations. As shown, symmetry is not required in an RPU system 230, which can be an unbalanced tree. The number and type of RPU hardware components in the configuration depend upon the requirements of the ANN model being trained. Some, or all, of the nodes in an RPU system 230 can be physical hardware and software. The RPU compute nodes 210 of the RPU system 230 can include virtualized hardware and software that simulate the operation of the physical hardware and software. Whether physical, virtualized, or a combination of the two, the RPU compute nodes 210 may be operated and controlled by clustering software that is specialized to coordinate and control the operation of multiple computing nodes.
Various configurations of the hardware components shown in
Each input layer node 302, 304, 306 of ANN 300 receives inputs x1, x2, x3 directly from a source (not shown) with no connection strength adjustments and no node summations. Accordingly, y1=f(x1), y2=f(x2) and y3=f(x3), as shown by the equations listed at the bottom of
ANN model 300 learns by comparing an initially arbitrary classification of an input data record with the known actual classification of the record. Using a training methodology known as backpropagation (i.e., backward propagation of errors), the errors from the initial classification of the first input data record are fed back into the network and is used to modify the network's weighted connections the second time around. This feedback process continues for several iterations. In other words, the new calculated values become the new input values that feed the next layer. This process continues until it has gone through all the layers and determined the output. In the training phase of an ANN, the correct classification for each record is known, and the output nodes can therefore be assigned correct values. For example, a node value of “1” (or 0.9) for the node corresponding to the correct class, and a node value of “0” (or 0.1) for the others. It is thus possible to compare the network's calculated values for the output nodes to these correct values, and to calculate an error term for each node (i.e., the delta rule). These error terms are then used to adjust the weights in the hidden layers so that in the next iteration the output values will be closer to the correct values.
As shown in 500, the non-linear function, softmax, is used to train the ANN model. The “P” values represent weights for each layer, and x1 represents the activation value input to the calculation at the first layer. The first layer of the forward cycle, 505, computes a vector-matrix multiplication (y=Wx) where the vector x represents the activities of the input neurons and the matrix W stores the weight values between each pair of input and output neurons.
In the example, 505 the NLF softmax operates on the local weight matrix 1P to output a vector matrix 1F1. In the next layer 510, vector matrix 1F1 now becomes the input to softmax NLF, which operates on that layer's local weight matrix 2P to output a vector matrix 2F1. Finally, in the last layer 515, vector matrix 2F1 becomes the input to the softmax NLF, which operates on that layer's local weight matrix 3P, to output a vector matrix 3F1.
Following the calculation of final output layer 515, the error signal is calculated and backpropagated through the network. The backward cycle on a single layer also involves a vector-matrix multiplication on the transpose of the weight matrix (z=WTδ), where the vector δ represents the error calculated by the output neurons. Finally, in the update cycle the weight matrix W is updated by performing an outer product of the two vectors that are used in the forward and the backward cycles and usually expressed as W←W+η(δxT) where η is a global learning rate.
Each operation 505, 510, 515 can occur in a pipeline paralleled fashion, thereby fully utilization the RPU hardware in all three cycles of the training algorithm.
As shown in
At 605, the RPU tile 140 receives from an outside source an activation input value, e.g., x1, at an input layer. At 610, the RPU tile 140 computes a vector-matrix multiplication, where the vector represents the activities of the input neurons and the weight matrix W stores the weight values between each pair of input and output neurons. Storing the weight matrix locally, allows computations that are pipeline parallel and asynchronous. At 615, non-linear activation is performed on each element of the resulting vector y, and the resulting vector is passed to the next layer (620). If this current layer is not the last input layer (625), then the resulting vector of the current layer, here 1F1 of 505, is passed as input to the next layer (630). At the next layer, the computation is repeated using the weight matrix (2P1) that is stored on the RPU tile 140 locally to the layer. The process returns to 615, as is repeated for each input layer.
If, at 625, the last layer is reached (e.g., 515 of
When the last output layer is reached, the weight matrix W is updated by performing an outer product of the two vectors that are used in the forward and the backward cycles, as shown in the update column 555 of
The present invention may be a system, a method, and/or a computer program product at any possible technical detail level of integration. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be accomplished as one step, executed concurrently, substantially concurrently, in a partially or wholly temporally overlapping manner, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.