This application claims foreign priority to European Patent Application No. 17210864.9, filed Dec. 28, 2017, the content of which is incorporated by reference herein in its entirety.
The disclosed technology relates to the field of integrated machine learning neuromorphic computing and neural networks, more particularly to hardware implementation of multi-layer perceptrons. In particular, the disclosed technology relates to a semiconductor cell for performing dot-product operations between a first and a second operand, an array of such semiconductor cells, and to a neural network comprising such an array or arrays.
Neural networks (NNs) are classification techniques used in the machine learning domain. Typical examples of such classifiers include Multi-Layer Perceptrons (MLPs) or Convolutional Neural Network (CNNs).
Neural network (NN) architectures comprise layers of “neurons” (which are basically multiply-accumulate units), weights that interconnect them and particular layers, used for various operations, such as normalization or pooling.
The computation involved in training or running these classifiers has been facilitated using Graphics Processing Units (GPUs) or custom Application-Specific Integrated Circuits (ASICs), for which dedicated software flows may be utilized.
Many software approaches have advocated the use of NNs (either MLPs or CNNs) with binary weights and activations, showing minimal accuracy degradation of state-of-the-art classification benchmarks. The goal of such approaches is to enable neural network GPU kernels of smaller memory footprint and higher performance, given that the data structures exchanged from/to the GPU are aggressively reduced. However, none of the known approaches can overcome the high energy that is involved for each classification run on a GPU, especially the leakage energy component related solely to the storage of the NN weights. A benefit of assuming weights and activations of two possible values each (either +1 or −1) is that the multiply-accumulate operation (i.e., dot-product) that is typically encountered in NNs boils down to a popcount of element-wise XNOR or XOR operations.
A dot-product or scalar product is an algebraic operation that takes two equal-length sequences of numbers and returns a single number. A dot-product is very frequently used as a basic mathematical NN operation. At least at the inference phase (i.e., not during training), a wide range of machine learning implementations (e.g. MLPs or CNNs) can be decomposed to layers of dot-product operators, interleaved with simple arithmetic operations. Most of these implementations pertain to the classification of raw data (e.g., the assignment of a label to a raw data frame).
Dot-product operations are typically performed between values that depend on the NN input (e.g., a frame to be classified) and constant operands. The input-dependent operands are sometimes referred to as “activations”. For the case of MLPs, the constant operands are the weights that interconnect two MLP layers. For the case of CNNs, the constant operands are the filters that are convolved with the input activations or the weights of the final fully connected layer. A similar thing can be said for the simple arithmetic operations that are interleaved with the dot-products in the classifier: for example, normalization is a mathematical operation between the outputs of a hidden layer and constant terms that are fixed after training of the classifier.
Dot-product operations, and therefore also neuromorphic applications, are read dominated. In terms of energy, this means that read energy outweighs write energy. Reduction in read energy is becoming an inevitable concern to do deep neural networks like BNN (binary neural network).
It is an object of the disclosed technology to reduce energy requirements of classification operations.
The above objective is accomplished by a semiconductor cell, an array of semiconductor cells and a method of using at least one array of semiconductor cells in a neural network, according to embodiments of the disclosed technology.
In a first aspect, a semiconductor cell is provided, comprising a memory element for storing a first binary operand, where the memory element provides complementary memory outputs, and a multiplication block that is locally and uniquely associated with the memory element. The multiplication block is configured for receiving complementary input signals representing binary input data and the complementary memory outputs of the associated memory element representing the first binary operand, and for implementing a multiplication operation on these signals, and for providing an output of the multiplication operation to an output port.
In a semiconductor cell according to embodiments of the disclosed technology, the multiplication block may be adapted to perform an XNOR or XOR logic function between the input data and the stored first binary operand.
A semiconductor cell according to embodiments of the disclosed technology may furthermore comprise a select switch for controlling provision of the output of the multiplication operation to an external circuit.
In a semiconductor cell according to embodiments of the disclosed technology, the memory element may be implemented in an SRAM implementation. In such embodiments, a binary weight may be stored as the first operand in cross-coupled invertors of the SRAM implementation.
In a semiconductor cell according embodiments of the disclosed technology, the memory element may furthermore comprise at least one input for receiving the first binary operand from a data line and at least one access switch connecting the at least one input to a memory unit of the memory cell, the at least one access switch being adapted for being driven by a word line for passing the first binary operand to the memory unit. Such semiconductor cell may have two access switches connecting two inputs to a memory unit, for providing complementary data of the first binary operand to the memory unit.
In a second aspect, the disclosed technology provides an array of semiconductor cells according to any of the embodiments of the first aspect, logically arranged in rows and columns.
An array according to embodiments of the second aspect may furthermore comprise word lines along the rows of the array and bit lines along the columns thereof, whereby the crossing of a set of word lines and bit lines uniquely identifies a location of a semiconductor cell in the array.
An array according to embodiments of the present invention may comprise word lines configured for delivering complementary input activations to input ports of the semiconductor cells, and read bit lines configured for receiving the outputs of the multiplication operations from the readout ports of the semiconductor cells in the array connected to that read bit line.
In a third aspect, the disclosed technology provides a neural network circuit comprising at least one array of semiconductor cells according to any of the embodiments of the second aspect; and a plurality of sensing units. A sensing unit (SU) is shared between different semiconductor cells of at least one column of the at least one array, for reading the outputs of the multiplication blocks of the shared semiconductor cells. The sharing of the sensing unit between different semiconductor cells of at least one column of the at least one array implements a time multiplexing operation. The neural network furthermore comprises a plurality of accumulation units, each accumulation unit arranged to sequentially accumulate the outputs of a particular sensing unit corresponding to sequentially selected semiconductor cells of the shared semiconductor cells.
A neural network circuit according to embodiments of the disclosed technology may furthermore comprise a plurality of post-processing units for further processing of the output signals of the accumulation units.
In a neural network circuit according to embodiments of the disclosed technology, at least two semiconductor cells that are sharing a single sensing unit may be grouped into an enlarged semiconductor unit, whereby the output ports of the at least two semiconductor cells are connected to a switch element, the output of the switch element being connected to the single sensing unit. The switch element may, in some embodiments, be adapted for allowing two multiplications and a single accumulation.
In such a neural network circuit, the switch element may be adapted for allowing multi-bit accumulation of the multiplication result of the at least two semiconductor cells grouped into the enlarged semiconductor unit. The accumulation may in some embodiments be achieved by using a high impedant pre-charged SU, and then taking the outputs of the SU at a specific time.
In particular embodiments, two semiconductor cells may be grouped into the enlarged semiconductor unit, and the switch element may be adapted for allowing two-bit accumulation for simultaneous readout of the two semiconductor cells grouped into the enlarged semiconductor unit. The switch element may comprise a first transistor with a first control electrode and a first and second main electrode and a second transistor with a second control electrode and a third and fourth main electrode. In the particular implementation where the transistors are MOS transistors, a control electrode may be a gate of a transistor and a main electrode may be a source or drain of a transistor. The first and third main electrodes are coupled together to a first reference voltage, and the second and fourth main electrodes are coupled together, potentially through a multiplexing switch, to the sensing unit. The first reference voltage should be a low impedant voltage source. It can be ground for an NMOS implementation of the transistors, or supply voltage for a PMOS implementation. However, the disclosed technology is not limited thereto, and the first reference voltage could be other voltages as well that suit the SU operation to distinguish the states that need be detected.
In the neural network circuit, an output signal of a first semiconductor cell of the at least two grouped semiconductor cells is coupled to the first control electrode, and an output of a second semiconductor cell of the at least two grouped semiconductor cells is coupled to the second control electrode. In particular embodiments, the switch element may furthermore comprise a third transistor with a third control electrode and a fifth and sixth main electrode and a fourth transistor with a fourth control electrode and a seventh and eighth main electrode coupled in series whereby the sixth main electrode is connected to the seventh main electrode, the fifth main electrode is coupled with the first and third main electrodes, and the eighth main electrode is coupled with the second and fourth main electrodes, the output of the first semiconductor cell being coupled to the third control electrode, and the output of the second semiconductor cell (20) being coupled to the fourth control electrode.
In embodiments of the disclosed technology, two activations are read simultaneously and are sensed as one cell. This reduces the read energy consumption by roughly half.
In a further aspect, the disclosed technology provides the use of a neural network according to embodiments of the third aspect of the disclosed technology for performing a clustering, classification or pattern recognition task. The neural network receives inputs from the external world in the form of a pattern and image in vector form. Each input is multiplied by its corresponding weight in a semiconductor cell according to embodiments of the disclosed technology. Weights are the information used by the neural network to solve a problem. Typically weights represent the strength of the interconnection between neurons inside the neural network. The weighted inputs are sensed and accumulated, and potentially limited to fall within a desired range (normalized). The neural network may be used for prediction, such as for processing or predicting the transition of a first frame to a second frame, based on a sequence of input frames that have been fed to the system.
It is an advantage of embodiments of the disclosed technology that a hardware based solution is provided to reduce energy consumption. Furthermore, the same hardware based solution reduces read delay.
Particular aspects of the disclosed technology are set out in the accompanying independent and dependent claims. Features from the dependent claims may be combined with features of the independent claims and with features of other dependent claims as appropriate and not merely as explicitly set out in the claims.
For purposes of summarizing the disclosed technology and the advantages achieved over the prior art, certain objects and advantages of the disclosed technology have been described herein above. Of course, it is to be understood that not necessarily all such objects or advantages may be achieved in accordance with any particular embodiment of the disclosed technology. Thus, for example, those skilled in the art will recognize that the disclosed technology may be embodied or carried out in a manner that achieves or optimizes one advantage or group of advantages as taught herein without necessarily achieving other objects or advantages as may be taught or suggested herein.
The above and other aspects of the disclosed technology will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.
The disclosed technology will now be described further, by way of example, with reference to the accompanying drawings, in which:
The drawings are only schematic and are non-limiting. In the drawings, the size of some of the elements may be exaggerated and not drawn on scale for illustrative purposes. The dimensions and the relative dimensions do not necessarily correspond to actual reductions to practice of the invention.
Any reference signs in the claims shall not be construed as limiting the scope.
In the different drawings, the same reference signs refer to the same or analogous elements.
The disclosed technology will be described with respect to particular embodiments and with reference to certain drawings but the disclosed technology is not limited thereto but only by the claims.
The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequence, either temporally, spatially, in ranking or in any other manner. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the disclosed technology described herein are capable of operation in other sequences than described or illustrated herein.
Moreover, directional terminology such as top, bottom, front, back, leading, trailing, under, over and the like in the description and the claims is used for descriptive purposes with reference to the orientation of the drawings being described, and not necessarily for describing relative positions. Because components of embodiments of the disclosed technology can be positioned in a number of different orientations, the directional terminology is used for purposes of illustration only, and is in no way intended to be limiting, unless otherwise indicated. It is, hence, to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the disclosed technology described herein are capable of operation in other orientations than described or illustrated herein.
Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosed technology. Thus, appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this disclosure are not necessarily all referring to the same embodiment, but may. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to one of ordinary skill in the art from this disclosure, in one or more embodiments.
Similarly, it should be appreciated that in the description of exemplary embodiments of the disclosed technology, various features of the disclosed technology are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the disclosed technology requires more features than are expressly recited in each claim. Rather, as the claims reflect, inventive aspects may lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of the disclosed technology.
Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the disclosed technology, and form different embodiments, as would be understood by those in the art. For example, in the claims, any of the claimed embodiments can be used in any combination.
It should be noted that the use of particular terminology when describing certain features or aspects of the disclosed technology should not be taken to imply that the terminology is being re-defined herein to be restricted to include any specific characteristics of the features or aspects of the disclosed technology with which that terminology is associated.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the disclosed technology may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
In embodiments of the present invention, semiconductor cells are logically organized in rows and columns. Throughout this description, the terms “horizontal” and “vertical” (related to the terms “row” and “column”, respectively) are used to provide a co-ordinate system and for ease of explanation only. They do not need to, but may, refer to an actual physical direction of the device. Furthermore, the terms “column” and “row” are used to describe sets of array elements, in particular in the disclosed technology semiconductor cells, which are linked together. The linking can be in the form of a Cartesian array of rows and columns; however, the disclosed technology is not limited thereto. As will be understood by those skilled in the art, columns and rows can be easily interchanged and it is intended in this disclosure that these terms be interchangeable. Also, non-Cartesian arrays may be constructed and are included within the scope of the disclosed technology. Accordingly, the terms “row” and “column” should be interpreted widely. To facilitate in this wide interpretation, the claims refer to logically organized in rows and columns. By this is meant that sets of semiconductor cells are linked together in a topologically linear intersecting manner; however, that the physical or topographical arrangement need not be so. For example, the rows may be circles and the columns radii of these circles and the circles and radii are described in the disclosure as “logically organized” rows and columns. Also, specific names of the various lines (e.g., word line and bit line) are intended to be generic names used to facilitate the explanation and to refer to a particular function and this specific choice of words is not intended to in any way limit the disclosed technology. It should be understood that all these terms are used only to facilitate a better understanding of the specific structure being described, and are in no way intended to limit the disclosed technology.
For the technical description of embodiments of the disclosed technology, the design enablement of a multi-layer perceptron (MLP) with binary weights and activations is used as an illustrative example. A similar description is valid, but not written out in detail, for convolutional neural networks (CNNs), with the appropriate reordering of logic units and the designation of the memory unit as storing binary filter values, instead of binary weight values.
Artificial neural networks are computing systems inspired by the biological neural networks that constitute human and animal brains. Such systems learn to do tasks by considering examples, generally without task-specific programming.
Typically, neurons 11 are organized in layers. Neurons 11 of different layers may perform different kinds of transformations on their inputs. In
The input layer 131 may have a first number Nin of neurons 11, and may hence accept the first number Nin of inputs. There may be a second number Ni of neurons 11 per intermediate layer 132, 133, 134, with Ni dependent on the intermediate layer and on the application. The output layer 135 may have a third number Nout of neurons 11. For training, Nin, Ni and Nout can be any number. For testing or classification, Nout should be smaller than Nin (Nout<Nin). The neural network 10 is dimensioned in terms of N (maximum number of neurons in any of the layers) and L (number of layers).
Neurons may have a state, generally represented by a real number, typically between 0 and 1. In particular implementations, these states are weights that vary as learning proceeds, which can increase or decrease the strength of the signal that it sends downstream.
In the particular example of Binary Neural Networks (or Binary MLP), first operands under the form of weights w are stored in the neurons 11, and second operands under the form of input activations x are received by the neurons. They may both be confined in the [−1, +1] interval. During training, the weights w and the input activations x are scalar values (w, x€[−1, +1]). During testing, the weights w and the input activations x may be binary values (w, x€{−1, +1}).
As illustrated in
In the example of BNN, the calculation part processes incoming input activations x and locally stored weights w, so as to obtain yk=Σj=0N−1xjwkj, with k the neuron in the next layer. This operation is called a dot-product operation. Evaluation of the kth neuron in a subsequent layer would be the dot-product of 0 to N−1 inputs (x) with weights (w). Each neuron in a subsequent layer will have the same inputs but weights will be different.
The normalization and non-linearity part may process the obtained output values yk of each neuron as follows, with μ, σ, γ, β normalization parameters obtained from training:
If at test time the weight values w and the input activations x are binary values (w, x€{−1, +1}), this corresponds in binary logic with w, x€{0, 1}. As a result, the dot-product operation corresponds to the following truth table:
Hence the dot-product operation (product between weight w and input activation x) which is the core operation in such neural networks, is actually an XNOR operation. If one of the inputs is swapped in sign, this can be expressed as an XOR operation.
In a first aspect, the disclosed technology relates to a semiconductor cell 20, as illustrated in
The semiconductor cell 20 furthermore comprises a multiplication block 22. The multiplication block 22 is locally and uniquely associated with the memory element 21 of the semiconductor cell 20. The multiplication block 22 has a first input port 221 and a second input port 222, for receiving the complementary memory outputs Q, and Qbar, from the first and second output ports 211, 212 of the memory element 21, respectively. The multiplication block 22 further has a third input port 223 and a fourth input port 224, for receiving the second operand X and its complement Xbar, respectively. The second operand X is a value fed to the semiconductor cell 20, which may be variable, and which may depend on the current input to the semiconductor cell 20, for instance a frame to be classified. The second operands X are sometimes referred to as “activations” or “input activation”. In particular embodiments of the disclosed technology, where MLPs are involved, the first operand can be one of the weights that interconnect two MLP layers. In alternative embodiments, where CNNs are involved, the first operand can be one of the filters that are convolved with the input activations, or a weight of a final fully connected layer.
The multiplication block 22 is configured for implementing a multiplication operation between the first operand stored in its associated memory element 21 and the second operand received by the semiconductor cell 20. The multiplication is done in place, i.e., within the semiconductor cell 20. The multiplication block 22 has an output port 225 for outputting the result “Out” of the multiplication operation (e.g., a digital output) for instance, for putting this result on a column line.
In a second aspect, a plurality of such semiconductor cells 20 may be arranged in an array 30, whereby the semiconductor cells are logically arranged in rows and columns, as for instance illustrated in
In
Such array 30 may comprise word lines configured for delivering second operands (input activations x) to input ports of the semiconductor cells 31. The input ports of the semiconductor cells 31 may coincide with or be linked to the third and fourth input ports 223, 224 of the multiplication block. The array 30 may also comprise read bit lines configured for receiving the outputs of the multiplication operation from readout ports of the semiconductor cells 31 connected to that read bit line. The readout port of a semiconductor cell 31 may coincide with or be linked to the output port 225 of the multiplication block 22.
It can be seen from
The three vertical bit lines, for this embodiment, are:
The read out values are then accumulated in accumulators 36. If so required, the accumulated values may be further processed in post-processing units 37. The further processing may comprise or consist of normalization and/or non-linear operations. The values so obtained per column can be read out and stored for further use, or can be directly used by further circuitry (not illustrated, and not discussed in further detail).
In the embodiment illustrated in
The activation signals Xi (Xi and Xbari) are directly fed into the semiconductor cells 31, more particularly they are put on the word lines WX and WXbar providing input to the multiplication block 22.
In this embodiment, and for the example illustrated, four cycles are needed to read out all multiplication values between the first and the second operands (i.e., one cycle for reading out each row). The read out values are then accumulated per column in accumulators 36, and, if so required, further processed in post-processing units 37. The further processing may comprise or consist of normalization and/or non-linear operations.
A corresponding timing diagram is shown at the left-hand side of
In an alternative embodiment of the third aspect, two activation inputs X, are enabled simultaneously, as illustrated in
One column 50 of an array 40 according to this embodiment is illustrated in
The connection to word lines and bit lines is illustrated in
A detailed implementation example of semiconductor cells 20 and supplementary circuitry for use in the modified neural network circuit 45, enabling two inputs simultaneously, is illustrated in
Illustrated are two semiconductor cells 20 according to embodiments of the first aspect of the disclosed technology. They are combined together in an enlarged semiconductor unit 51. One semiconductor cell 20, implemented in SRAM technology, is illustrated in more detail at the left-hand side of
The word line WL and the bit lines BL, BLbar are provided for writing a value into the memory element 21. The memory element 21 has a first output port 211 and a second output port 212 for delivering the stored value and its complementary value, respectively.
The multiplication block 22, in the embodiment illustrated as an XNOR block, has an output port 225 for delivering the result of the multiplication operation carried out on the first operand, being the value stored in the memory element 21, and the second operand, being the input activation received by the semiconductor cell 20. The output ports 225 of the two semiconductor cells 20 together forming the enlarged semiconductor unit 51 are fed to a switch element 52.
The switch element 52 is such that the outputs 225 of the semiconductor cells 20 are each connected to a gate of a transistor T1, T2, the two transistors T1, T2 being coupled in parallel between ground and a read bitline RBL. A switch 53 is provided between the two transistors T1, T2 and the read bitline.
If the switch 53 is closed (e.g., if this switch is formed by a transistor), by bringing its gate, connected to a read word line RWL, to high, a combined output signal of the two semiconductor cells 20 can be read from the read bitline RBL. The read bitline is charged to high first (pre-charged). If the output of both semiconductor cells is low, the transistors T1 and T2 both do not go in conduction, and the charge brought on the read bitline RBL substantially remains there. When the sensing unit SU (e.g., sense amplifier) senses the charge on the read bitline RBL, it senses a high value, and it determines therefrom that the output of both semiconductor cells 20 being read out is low. If the output of either one of the semiconductor cells 20 is high, the read bitline RBL is pulled to ground, and the charges previously stored there leak away. If the output of both semiconductor cells 20 is high, the read bitline RBL is also pulled to ground and the previously stored charges leak away. This time, this goes even faster.
In an alternative embodiment to
An array of enlarged semiconductor cells MEXN2_B, illustrated in detail in
It is an advantage of these embodiments of the disclosed technology with enlarged semiconductor units 51 that only one sense operation is required, where previously, to read out the same, two sense operations and a separate combination operation would have been required. Simultaneous reading can now be done on a single bitline. This means lower read energy is required, and the readout throughput has doubled.
However, this process is illustrated in
This can be solved by implementing the switch element differently, as for instance illustrated in
The way of working is similar, in that the read bitline RBL is charged high first, e.g. pre-charged at positive power supply voltage VDD. If none of the semiconductor cells 20 have an output high, the charge remains on the read bitline RBL, and can be read out as such by the sensing unit SU (e.g., sense amplifier). If either one of the semiconductor cells 20 has an output high, one of the transistors T1 or T2, and only one of the transistors T3 or T4 go into conduction. The charge leaks away from the read bitline RBL and this charge drop can be detected by the sensing unit SU. The charge does not leak away, however, over the series connection of transistors T3 and T4. If, however, both semiconductor cells 20 have an output high, all transistors T1, T2, T3 and T4 go in conduction, and charge leaks away from the read bitline RBL very fast. This fast or slower leaking away of the charge from the read bitline RBL can be detected by the sensing unit SU, which can discriminate this way between the different situations.
In the embodiment illustrated in
This is illustrated in the simulation results shown in
The sense amplifier design is as illustrated in
While the invention has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative or exemplary and not restrictive. The foregoing description details certain embodiments of the invention. It will be appreciated, however, that no matter how detailed the foregoing appears in text, the invention may be practiced in many ways. The invention is not limited to the disclosed embodiments. For example, the invention does not need to be implemented with SRAM memory elements, but can make use of any type of non-volatile memory.
Number | Date | Country | Kind |
---|---|---|---|
EP 17210864.9 | Dec 2017 | EP | regional |