The invention pertains to the physical implementation of a theoretical model of a neural network. The invention relates more particular to an optimizing of an application of a neural network. The invention is aimed at making it feasible to produce processors based on a theoretical model of a neural network.
A novel theoretical model of neural networks has been designed by C. BERROU and V. GRIPPON. These novel networks, called GBNNs, are described especially in the documents [1] Vincent Gripon, Claude Berrou: “Sparse Neural Networks With Large Learning Diversity”, [2] Vincent Gripon, Claude Berrou: “Nearly-optimal associative memories based on distributed constant weight codes”, [3] X. Jiang, V. Gripon and C. Berrou: Learning long sequences in binary neural networks, Proceedings of Cognitive 2012”, [4] Vincent Gripon, Claude Berrou: “Dispositif d'apprentissage and de decodage de messages, mettant en ceuvre un réseau de neurons” (Device for learning and decoding messages, implementing a neural network), patent application No. FR1056760, the disclosures of which are referred to in the present application. There are publications that describe especially methods for creating such neural networks based on this theoretical model [5] “H. Jarollahi, N. Onizawa, V. Gripon & W. J. Gross: Architecture and Implementation of an Associative Memory Using Sparse Clustered Networks”.
Concretely, this module is based, on the one hand, on Hopfield networks (discrete-time recurrent neural networks for which the connection matrix is symmetric and zero on the diagonal) and secondly LDPC (low-density parity check) type error correction codes. A GBNN network comprises a set of clusters each comprising a set of neurons and uses the notion of a neural “clique” for the learning and recognition of information. It has been demonstrated in [1] [2] [3] [4] that the associative memory created by a GBNN network shows appreciably better performance, for an equal number of neurons, than an associative memory based on the Hopfield model. In addition, this GBNN associative memory is also resistant to errors.
The discovery of this novel model of neural networks has been widely taken up by the scientific community.
First of all we present the GBNN model and then the unique proposition for implementing this currently existing model.
The authors of this network have used the technique of error corrector codes to overcome the drawbacks of the Hopfield networks. An error corrector code (turbo-code, LDPC) adds redundancy to messages in order to make them more robust against noise.
It must be noted that just like the value of the neurons, the synaptic weights are Boolean values in the GBNN model, and this distinguishes them from most of the known networks (Perceptron, Hopfield and other networks), in which the synaptic weights are either integers or real numbers. More particularly, in GBNN, the synaptic weights are replaced by coefficients and the matrix of the synaptic weights is replaced by an adjacency matrix. However, although the operands are Boolean values, the computations made in GBNN are made in the domain of the integers, i.e. the partial results are integer values. In addition, the GBNN model relies on the notion of the cluster: a GBNN type neural network is thus formed by C clusters each containing L neurons (known as James) having the particular feature of not being mutually connected within a same cluster. The message (or pattern) to be learned is first of all divided into sub-messages (this is the principle of the blockwise codes): each sub-message, having a size k, is associated with A thrifty code sized L=2̂k. A thrifty code is a code containing all the binary words on {0; 1} sized n having only one bit valued 1.
The following notations are used throughout the presentation of the prior art and the invention:
The main advantage of the GBNN network is that the network obtained has efficiency appreciably greater than that of the Hopfield networks. The number of patterns that can be memorized is in the order of: (C−1)×Lz/[2Cz×log 2(LC)], against N/(2·log(N)) in a Hopfield network.
Besides, the capacity of the network (minimum quantity of binary information that can be learned by the network) passes from N̂2/(2·log(N)) in a Hopfield network to log 2(M+1) combinations of 2 among N. For example, a Hopfield network of 256 neurons enables the creation of an associative memory that will retain about 50 words of 256 bits (giving 400 words of 32 bits) whereas a GBNN formed by four clusters of 64 neurons each (giving 256 neurons) therefore working on 4×log 2(64), could retain about 1200 words of 32 bits. The maximum quantity of binary information learned is therefore far greater than that of a Hopfield network.
The attractiveness of the GBNN model was immediately perceived by the scientific community. In the current state, however, only one physical implementation of this theoretical model has been described.
More particularly, this physical implementation has been described in the document “Architecture and Implementation of an Associative Memory Using Sparse Clustered Networks”: see [5] (this implementation is here below named V0).
The article [5] cited in reference proposes the first implementation on an FPGA of the GBNN (V0). The working of GBNN V0 is the following presented with reference to
The approach proposed in this document has many drawbacks.
For each neuron, the proposed architecture takes the integer sum of the products of the states of the (C−1)·L remote neurons multiplied by the (C−1)·L relative coefficients of adjacency. Obviously, this results in a large number of computations which entail the use of a large number of computers working in parallel to obtain high time-related performance. This conventionally results in a very great surface and very great energy consumption. In addition, since the computations are made on integers, the propagation time of the signals in the arithmetic operators takes place along the strings of carry-over values, correspondingly slowing down the frequency of operation of the circuits (whether they are programmable processor type circuits or non-programmable dedicated type circuits).
In addition, each neuron is permanently connected to the adjacency matrix and at each iteration it receives the value of the (C−1)·L neurons to which it can be connected. As a consequence, the architecture makes it necessary to connect each of the N neurons contained in the network to all their respective remote neurons (i.e. (C−1)×L). This results in a prohibitive number of connections i.e. C×(C−1)×L2 connections. The inventors have noted that certain communications are necessary: indeed, the inactive neurons have no influence on the results (the accumulation of zero results originating FROM partial products of the values of inactive neurons (hence zero values) multiplied by coefficients in the adjacency matrix). It is therefore not necessary to communicate their values. In addition, the communications could be serialized to reduce the number of connections.
In addition, the inventors have also noted that the solutions described in [5] propose memorizing the states of the connections in a square adjacency matrix. Thus, if two neurons ni,j and ni′,j′ are connected, the matrix will contain two coefficients: a first coefficient w((i,j)(i′,j′)) for the neurons ni,j and a second coefficient w((i′,j′)(i,j)) for the neurons ni′,j′. The values of the coefficients w((i,j)(i′,j′)) and w(((i′,j′)(i,j)) are identical since these coefficients represent the state of one and only one connection between the neurons ni,j and ni′,j′. The proposed architecture gives rise to a dual memorizing of the same piece of information. The size of the matrix of adjacency coefficients could therefore be divided by two and the computations made during the phase could be halved without modifying the general operation of the network.
Finally, in a GBNN network, each neuron carries out a set of computations from a set of information elements transmitted to it (i.e. the states of the remote neurons) and a set of information elements stored locally (the states of the connections stored in the local adjacency matrix). Since the states of the connections are thus memorized in the remote clusters (see the above paragraph), the partial sums of the products could be made in the remote clusters. Consequently, only the corresponding results could be transmitted, reducing the number of connections needed between the neurons to the same extent. Moreover, a large number of unnecessary information elements are communicated and used in the computations. Indeed, certain operands and certain computations are unnecessary, e.g. the inactive neurons have no influence on the results (the accumulation of zero results originating FROM partial products of the values of inactive neurons (hence zero values) multiplied by coefficients in the adjacency) matrix°.
Thus, it is necessary to propose solutions to enable the making of an architecture that can be really scaled up and is economically profitable in relation to the storage capacity created.
More particularly, the problems posed by this V0 architecture are the following:
3. an excessively large number of connections between the neurons;
These problems make it impossible to obtain large-scale implementations. They also give rise to problems of cost, frequency and/or number of data access points and energy consumption.
The invention makes it possible to at least partially overcome the defects of the previously proposed architecture and therefore the basic theoretical model itself. Indeed, the invention relates to a method for obtaining a piece of data representing an identifier of one neuron from among a set of L neurons called a cluster, L being a natural integer of a value greater than or equal to two, said cluster belonging to a neural network comprising C clusters, C being a natural integer of a value greater than or equal to 2, each neuron of said neural network comprising at least two states.
According to the invention, said method comprises, for at least one current cluster Ci:
Thus, the method of the invention makes it possible to obtain a winning neuron in a given cluster as a function of the states of the other neurons of the clusters which form part of the neural network. Depending on the embodiments, the obtaining of the sets of states is done differently.
According to one particular embodiment, said step for computing comprises, for a current neuron ni,j of said current cluster Ci, the application of the following formula:
wherein:
(ni,j,t+1) is the state of the neuron ni,j at the instant t+1;
Λk=1,k≠iC( . . . ) is a conjunction (logic AND) of C−1 non-zero binary elements given by the logic equations applied to all the other clusters of the neural network;
w(i,j)(k,g) is the coefficient of adjacency between the neuron nk,g and the neuron ni,j;
(nk,g,t) is the state of the neuron nk,g at the instant t;
Vg=1L is the operation of disjunction (logic OR) of L binary elements representing the state (active or not active) at the instant t of the neurons of the remote clusters;
(Vg=1L . . . ) is the operation of complemented disjunction (logic NOR) of L binary elements.
The complemented disjunction operation (logic NOR) of L binary elements makes it possible for a remote cluster k, when this cluster is active [in other words if the remote cluster contains at least one active nk,g], to determine whether one (logic OR) of these active neurons nk,g is connected (w(i,j)(k,g)) to the neurons ni,j at the instant t. If not, this complemented disjunction operation makes it possible not take account of a piece of information which would then be irrelevant.
According to one particular embodiment, said step for computing comprises a step of selection, from among the neurons of said current cluster Ci, of the neuron NG, the state of which at the instant t+1 is 1.
According to one particular embodiment, said step for computing comprises a step of selection, from among the neurons of said current cluster Ci, of the neuron NG, the state of which at the instant t+2 is 1, in applying the following function:
According to one particular embodiment, said step for obtaining a set A of coefficients of adjacency originating from at least one cluster Cj, j≠i comprises a plurality of steps of access to at least one centralized structure for memorizing coefficients of adjacency of neurons of said neurons of said clusters of said neural network.
According to one particular embodiment, said at least one centralized structure for memorizing coefficients of adjacency of neurons takes the form of a blockwise triangular matrix comprising a number of blocks equal to Σi=1C-1 i, one block comprising L adjacency coefficients.
According to one particular embodiment, said step for obtaining a set E of states of neurons originating from at least one cluster Cj, j≠i comprises L steps of simultaneous transmission by each cluster Cj, j≠i, of a single state of a single neuron.
According to one particular embodiment, said step for obtaining a set E of states of neurons originating from at least one cluster Cj, j≠i comprises C steps of simultaneous transmission, one per cluster Cj, j≠i, of all the states of the cluster.
According to one particular embodiment, said step for obtaining a set E of states of neurons originating from at least one cluster Cj, j≠i comprises, within said current cluster Ci, a step for implementing a shift register of a predetermined size.
According to one particular embodiment, said predetermined size of said shift register is equal to the number of steps of simultaneous transmission.
According to one particular characteristic, said at least one cluster Cj, implements at least one part of said step for computing and transmits to said cluster Ci, the sum and/or the disjunction of the coefficients of adjacency of these active neurons.
According to one particular implementation, the different steps of the methods according to the invention are implemented by one or more software programs or computer programs comprising software instructions intended for execution by a data processor of a relay module according to the invention and being designed to command the execution of the different steps of the methods.
Consequently, the invention is also aimed at providing a program, capable of being executed by a computer or by a data processor, this program comprising instructions to command the execution of the steps of a method as mentioned here above.
This program can use any programming language whatsoever and take the form of a source code, object code or an intermediate code between source code and object code, such as in a partially compiled form or in any other desirable form whatsoever.
The invention is also aimed at providing an information carrier readable by a data processor and comprising instructions of a program as mentioned here above.
The information carrier can be any entity or device whatsoever capable of storing the program. For example, the medium can comprise a storage means such as a ROM, for example a CD ROM or a microelectronic circuit ROM or again a magnetic recording means such as floppy disk or a hard disk drive.
Besides, the information carrier can be a transmissible carrier such as an electrical or optical signal, which can be conveyed via an electrical or optical cable, by radio or by other means. The program according to the invention can especially be uploaded to an Internet type network.
As an alternative, the information carrier can be an integrated circuit into which the program is incorporated, the circuit being adapted to executing or to being used in the execution of the method in question.
According to one embodiment, the invention is implemented by means of software and/or hardware components. In this respect, the term “module” in this document can correspond equally well to a software component as to a hardware component or to a set of hardware or software components.
A software component corresponds to one or more computer programs or several sub-programs of a program or more generally to any element of a program or a software package capable of implementing a function or a set of functions, according to what is described here below for the module concerned. Such a software component is executed by a data processor of a physical entity (terminal, server, gateway, router, etc) and is capable of accessing hardware resources of this physical entity (memories, recording media, communications buses, input/output electronic boards, user interfaces, etc).
In the same way, a hardware component corresponds to any element of a hardware assembly capable of implementing a function or a set of functions according to what is described here below for the module concerned. It may be a programmable hardware component or a component with an integrated processor for the execution of software, for example an integrated circuit, a smartcard, a memory card, an electronic card for executing firmware, etc.
The different embodiments mentioned here above can be combined with one another to implement the invention.
Other features and advantages of the invention shall appear more clearly from the following description of a preferred embodiment, given by way of a simple illustratory and non-exhaustive example, and from the appended figures, of which:
The general principle of the invention relies, in at least one embodiment, on the improvement of the implementing of the GBNN model. More particularly, the solution for existing digital integrated circuits (such as the one proposed in “Architecture and Implementation of an Associative Memory Using Sparse Clustered Networks”) is actually a simple transposition of the theoretical model to a digital hardware architecture without identifying the limits of such a transposition. Now, for this transposition of the model to a digital integrated circuit to be economically and technologically viable in a scaling-up or large-scale implementation, it is necessary precisely to identify the limits of this transposition and to propose solutions to resolve the problems raised. Complementarily, the identification of the limits of the transposition also makes it possible to identify the limits of the theoretical model, as explained here below.
The inventors have succeeded in identifying these limits, and this is one of the constituent elements of the invention. Indeed, in addition to providing optimization, one of the difficulties lay in identifying the problems raised and the source of these problems. More particularly, the inventors have identified the fact that to enable the large-scale implementation the initially proposed architecture, it was necessary to modify the way in which the computations, the communications and the memorization was done but also to modify the type of information exchanged. Each of the modifications made in the original architecture is independent of the other modifications and each of these modifications drastically reduces the costs of the initial architecture. Naturally, when all the optimizations are combined, the reduction is such that the architecture becomes technologically plausible and economically exploitable under current conditions.
More specifically, as presented in detail here below, the optimizations relate to:
These last two points reduce the number of connections on the circuit.
Thus, the invention makes it possible to divide the number of logic units needed to make such a circuit almost by ten. This reduction of the surface is accompanied by an improvement in time-related performance through the increasing of the operating frequency of such a circuit, but also through a reduction in the energy consumption through the reduction of the number of points of access to the memories, the number of communications and the simplifying of the computations.
The general principle of the invention can also be implemented by modifying or creating a logic code that can be executed by one or more processors (CPU, GPU, etc.). In this case, the improved architecture of the invention takes the form of an optimized executable code. The advantages drawn by this executable code are however the same or of the same nature as those of an implementation on a digital/physical circuit.
Each of these optimizations gives rise to one embodiment of the invention. The combination of these optimizations gives rise to a novel embodiment. Besides, the optimization studies carried out on this architecture have led the inventors to define an improved version of the initial model by replacing the data exchanged between the clusters by new data. This is therefore a fifth optimization of the initial architecture. According to the invention, in this fifth embodiment, called a “Super Neuron” embodiment, each cluster is seen as a single neuron in the binary state. As presented here below, in this fifth embodiment, a cluster, depending on the sub-message transmitted to it, plays the role alternately of one neuron after another of the L neurons that form it. In other words, in this fifth embodiment, the cluster is a reconfigurable neuron (i.e. the cluster plays the role of a single neuron at a given point in time).
In general, and from a certain point of view, the invention can be described as a method for obtaining at least one winning neuron NG. As compared with the prior art described here above, the winning neuron is obtained by applying at least one of the optimizations made. Thus, the invention, in at least one embodiment, relates to a method for obtaining a piece of data representing an identifier of a neuron from among a set of L neurons called a cluster, L being a natural integer with a value greater than or equal to two, said cluster belonging to a neural network comprising C clusters, C being a natural integer with a value greater than or equal to two, each neuron of said neural network comprising at least two possible states (for example lit/extinguished, on/off, 0, 1).
According to the invention, the method comprises, for at least one current cluster C, within an iterative updating process, neurons and/or clusters of the neural network:
Each of the steps of this method relates to at least one of the optimizations presented here above. For example, the computation step is performed by carrying out a binarization of the computations for obtaining the winning neuron or neurons and/or for the WTA mechanism. The steps for obtaining sets of states and coefficients are performed either locally or in a centralized way as a function of the embodiments described here below. In the “Super Neuron” embodiment, each cluster sends the sum or disjunction (depending on whether it is integer computation or binary computation) of the adjacency coefficients of its active neurons instead of sending the state of the neurons. This means that the sets A and E described here above are really obtained but that at least one part of the computation is done at the cluster Cj, and the transmitted information is different from the information transmitted in the other embodiments.
Naturally, as described here below, it is possible to represent the invention directly in the form of a system integrating modules that can take charge of the optimizations described. In another form, the invention can also be seen as an electronic circuit that directly integrates electronic components to implement the invention. In this case, the components are laid out so that they can carry out the learning and recognition of messages by means of the hardware optimizations proposed (shift registers, multiplexes, physical transpositions of logic equations, etc.). The system can therefore be polymorphous, depending on the form of implementation of the method. The invention can also be implemented in the form of a microprocessor. The invention can also be implemented in the form of a device or special processor that is adjoined to a circuit or to an Existing Processor in Order to Enable the Technique of the Invention to be Taken into Account.
The idea of the basic matrix version is to optimize the logic of computation of the architecture: the basic version of the architecture (V0) proposes integer computations. The first optimization made (version V1) passes to binary computation.
Thus, in each cluster, the proposed architecture (version V1) replaces the integer sum of the products of the states of the (C−1)·L remote neurons by the (C−1)·L relative coefficients of adjacency obtained by means of a logic equation handling Boolean variables. In addition, the proposed architecture (version V1) no longer requires, as is the case in the theoretical model, the performance of computations related to the WTA to take a “flexible” decision (i.e. in each cluster, to define which are the different active neurons if we consider that several neurons can be active in a cluster). Besides, in order to take a “hard” decision (i.e. define which is the only active neuron), which is not proposed in the initial theoretical module, the WTA is replaced by a simple logic equation handling Boolean variables. In other words, whatever the type of decision taken, a summing of integers and a comparison between the sums obtained are replaced by a logic equation, which consumes far less computation capacity.
By way of an illustration of the principles of this embodiment, we have chosen a pedagogical network model (three clusters of three neurons each) which makes it possible to understand the principle of this optimization. Referring to
In
The learning process (memorizing of coefficients in the adjacency matrix, i.e. in the GBNN model, the memorizing of the existence or non-existence of a connection between two neurons) is not described in detail in this embodiment because it is appreciably identical to that of the version V0.
The validity from the viewpoint of the initial model, of the passage from a piece of “integer” type data to a piece of “binary” type data is demonstrated as follows: the value of a neuron is computed according to the values of the other neurons, and the connections between these neurons. The initial equation can be reduced to a simpler logic equation to be implemented: the idea of the inventors is to work henceforth only with logic operators and binary values (in the version V0, the value of a neuron is first of all an integer and then the cluster chooses that one of its neurons which has the greatest value, this is the WTA principle).
The passage from a computation of integers to a computation of Boolean values is done by observing that, in a cluster, at the end of the WTA, only one neuron has been chosen (when the situation relates to a case of “hard” decision making, see here below). Consequently, when sending its values, this cluster will transmit only one bit at 1 and all the other bits at 0.
When the operation is placed at the level of a cluster that receives information from remote clusters, in the version V0, each neuron must take the sum of its inputs weighted by the value of the coefficient of the adjacency matrix (either 0s or 1s). However, the inventors have found that it is not necessary to take an integer sum: it is enough to perform a simple logic AND operation on the value transmitted by all the remote clusters. If the Boolean result of this operation is 1, then it means that the current neuron is on the clique on which the remote neurons are situated. Indeed, this result, in terms of integer value, would give (C−1).
By way of an example, the following is the simplified equation for updating (instant t+1) the first neuron (a1) of the first cluster (A) in the case of a network of three clusters of three neurons each: a1(t+1)=((b1(t)·wa
In other words, a neuron will be potentially active if it is connected to at least one of the active neurons in each of the clusters having active neurons (i.e. if it belongs to the current clique).
One condition for this equation to be valid is to overlook the extinguished clusters, (i.e. clusters not containing active neurons): for example, if no neuron of the cluster B is active (i.e. if the sub-message of the cluster B is erased), then the logic AND operation that the neurons of the other clusters will perform should not be falsified by the values of B. By adding a NOR function to all the neurons of each cluster, we obtain the desired behavior: a cluster that contains no active neurons remains transparent and therefore does not interfere in the result. Concretely, the adjoining of the NOR to all the neurons maintains the property of the initial model according to which the recognition is possible even with a partial message.
The above equation can be generalized as follows for C clusters and N neurons in using the notations provided previously (and the general notations of Boole's algebra):
The value (υ(ni,j, t+1)) which is not zero at the instant t+1 of a current neuron ni,j, (i being the identifier of the cluster and j being the identifier of the neuron in the cluster i) is the result of a conjunction (logic AND, Λk=1,k≠iC( . . . )) of non-zero results given by the logic equations applied to all the other clusters of the neural network. These logic equations make it possible, for each remote cluster k, to determine when this cluster is active (logic NOR on the υ(nk,g, t) values) [in other words, if the remote cluster contains at least one active neuron nk,g], if a (logic OR) of these active neurons nk,g is connected (w(i,j)(k,g)) to the neuron ni,j at the instant t.
In order to take a “flexible” decision (i.e. to define which are the different active neurons in each cluster when it is considered that several neurons can be active in a cluster), no additional computation is necessary relative to the previous one.
However, in order to take a “hard” decision (i.e. in order to define which is the unique active neuron when it is considered that a unique neuron can be active in a cluster), the WTA is replaced by the following logic equation:
In addition, unlike in the original model of the GBNN, the performance of all the computations in binary mode intrinsically enables the network not to produce false positives and not to provide erroneous solutions as described in reference [2].
In the GBNN version as described originally, a message can be represented by a non-oriented graph whose nodes are words of the message. The architects of the version V0 limited themselves to this implementation which is strictly identical to the model. The existence of the connections between the words, i.e. the existence of an edge between two nodes of the graph is therefore stored in a symmetric matrix. However, there is a redundancy of information in this symmetric matrix. Indeed, the existence or non-existence of an edge connecting two nodes x and y of the graph is memorized by means of two memorizing elements: a first memorizing element is associated with the node x and, by means of one bit, stores the existence or non-existence of an edge with the node y; a second memorizing element is associated with the node y and, by means of one bit, stores the existence or non-existence of an edge with the node x. Thus, in such a matrix, each row represents the connections between a neuron and all the other neurons of all the other clusters, and each cluster is represented by a group of rows (one for each of its neurons). This redundancy is not negligible.
Now the inventors have found that it is possible, without losing information, to halve the number of memorizing elements needed for the storage of the coefficients. The computation logic is not affected, since the number of neurons (peaks of the graph) is not reduced. This is illustrated with reference to
The inventors have had the idea of modifying this matrix to reduce the resources allocated to the memory. The optimization consists in storing the adjacency coefficient of two neurons at only one place. Thus, the invention passes from a square matrix to a blockwise triangular matrix. Naturally, the general architecture is modified. It is indeed necessary to draw two tracks (or make two access points) towards this single memory point to feed the operators of the two concerned neurons so that each of these two neurons can have access to the information. However, since these tracks already exist in the square matrix, this optimization does not result in any increase in the number of tracks or signals (or number of access points in the context of a software implementation). Thus, no increase is generated and this is an advantage.
Thus, while keeping the number of tracks or access points unchanged, the optimization made is important in terms of memory points. Indeed, the number of bits of the matrix is of the order of L2C2, specifically L2*C*(C−1), in the version VA (a matrix of LC*LC bits, from which the diagonal has been eliminated since the neurons of the same cluster cannot be connected to each other). A leveling of the matrix (elimination of the lower triangle) makes it possible to arrive at 0.5*L2*C*(C−1), which therefore gives substantial savings since these embodiments represent increasing orders of magnitude relative to the architecture V0. To enable access to the information stored in this triangular matrix, a modification is made in the direction of reading of certain data. Indeed, the pieces of data that were memorized at memory points eliminated by the triangularization of the matrix were preliminarily read in rows. They must therefore henceforth be read in columns. More particularly, the horizontal access AccH to the adjacency coefficients α and β of
Besides, a major corollary effect is obtained unexpectedly: the learning resource, i.e. the logic that makes it possible to give the different coefficients of the matrix their value, is also halved: indeed, since the number of coefficients is divided by two, the logic that enables these coefficients to be written is reduced by an equivalent value.
Thus, as compared to the version V0, the inventors have been able not only to reduce the computation resources but also the memory and learning resources. About 50% of the surface area is thereby gained by this optimization.
In one embodiment of the invention, rather than drawing a track between each pair of remote neurons (to make them transmit information), a serialization is carried out on the transfer of data between clusters. Indeed, since the clusters exchange information during the rendering (and in certain models during the learning), the reduction of the connections between clusters requires the serialization of the data: if a cluster must transmit L values to all the other clusters, we have initially L2*C*(C−1) icons in the network of interconnections between clusters (here below, the term RIC is also used to designate the “network of interconnections between clusters”). This serialization consists in implementing several steps to enable a transfer of the pieces of data one after the other.
In enabling a cluster to distribute a bit that it receives among all its neurons, the operation passes to L*C*(C−1) tracks, which is a first optimization. By serializing the transfers, the number of tracks of the interconnection network between clusters is reduced and the computation resources used are pooled since the neurons (or the clusters) use these resources in turn (i.e. they receive information pertaining to them). The inventors have identified two axes of serialization:
The serialization implies an increase in latency, respectively by a factor L or C. Serialization also implies the introduction of a sequencer in the architectures as well as a selection and routing circuitry. The sequencer is an automaton (or finite states machine) comprising L or C, depending on whether the serializing is done by neurons or by clusters.
The selection and routing circuitry requires the use of a series of multiplexers (conventionally organized in arborescent form) enabling the selection if necessary of a cluster or a neuron. In addition, to drive these multiplexers, an addressing vector (obtained by transcoding of the state register of the sequencer) would be necessary and would increase the cost of such an approach to an equivalent degree. The excess cost of adding such multiplexers and of the associated logic is such (see
Here below, we present the two identified serialization axes which enable the basic architecture to be optimized.
In this embodiment of the invention, the updating of the network, i.e. an iteration of decoding takes C steps (namely one step per cluster of the neural network). As illustrated in
The logic equation described in the version V1 is still applicable but depending on the embodiments, the intermediate results are stored (the time taken to receive the totality of the data); here is an example of the updating of the first neuron of the first cluster if there are clusters C in the network (there are C updating steps):
δ0:a1(δ0)←1
δ1:a1(δ1)←a1(δ0)·((b1·wa
δ2:a1(δ2)←a1(δ1)·((c1·wa
. . .
δC-1:a1(t+1)←a1(δC-2)·((z1·wa
At the step δ0, the value of the neuron a1 at the instant t+δ0 and is initialized at 1. Then, at the next step δ1, the result of the preceding step is accumulated in this same neuron with the result of the application of the equation of behavior of the neurons defined here above (see 5.2.1). Then, the new values of the neuron continue to be accumulated until the C clusters have been traversed, i.e. after C steps have ended in an iteration (passing to the instant t+1).
One drawback of this serialization is that a cluster transmits data to itself (since the RIC forms L tracks, when a cluster transmits its information, it will also receive it). However, the inventors have overcome this drawback by including a mechanism of transparency: when a cluster transmits, then it does not receive the data transmitted. In the above equation, this amounts to placing the intermediate result at 1 during the first step (since it is then the first cluster that transmits the information).
However, the inventors have noted that this implementation, although it is valuable from the logic point of view, is sub-optimal. Indeed, it has been observed by attaching numerical values to this architecture, that the gains relative to the non-serialized matrix version (V1.0) are smaller than expected: the reason for this is that a non-negligible part of the gain achieved in terms of computation resources (computation of the value of the neurons and WTA at the level of the cluster) is compensated for by the loss induced by the use of routing resources (each neuron must multiplex one weight among (C−1), thus leading to an L2·C·(C−1) order routing logic). In other words, in this embodiment, the computation logic is shifted, but it is not reduced in the expected proportions.
However, the inventors have had the idea of an optimizing that takes advantage of all the serial versions of the architecture: the use of shift registers to store memory points. This use is described here below.
In this embodiment, the updating of the network takes L steps, one step per neuron. At each step, each cluster transmits the value of one and only one of its neurons on the interconnection network (RIC). This value is retrieved by all the neurons of the other clusters. Thus, the RIC comprises C tracks, but each cluster receives only (C−1) tracks (a cluster does not speak to itself). The problem of transparency encountered in the version V1.1 does not arise. The principle is described with reference to
As in the case of version V1.1 the operating equation is identical to the matrix version (V1), and the serialization makes it possible to pool the computation resources. However, the inventors have also noted that this implementation, although it is interesting from the logic viewpoint, is sub-optimal. Indeed, it has been observed, in assessing the costs of this architecture, that the gains as compared with the non-serialized matrix version (V1.0)] are low: the reason for this is that, as above, a non-negligible part of the gain achieved in terms of computation resources (computation of the value of the neurons and WTA at the level of cluster) is compensated for by a loss caused by the use of routing resources.
Let us assume that four adjacency coefficients have been stored in the adjacency matrix, and that it is sought to access them sequentially (since the architecture is serialized).
The address (a0, a1) represents the state of the sequencer which is a binary word encoded in One-Hot encoding (a word containing a maximum of only one bit at 1). It is noted that, to pick off the four weights (four flip-flops), it is necessary to implement a relatively heavy multiplexing logic (for L weights it is necessary to have (L−1) MUX2:1).
The inventors have found that it is possible to considerably reduce the resources by creating a flip-flop ring: in learning mode, this shift register is filled with the input (data In) presented and in rendering, it copies itself into itself. It suffices therefore, in order to carry out this routing, to have a 2-to-1 multiplexer driven by the learn bit (
The gain is considerable since the inventors have also been able to eliminate the entire routing logic and since the learning logic is reduced by one order of magnitude. This operation is valid for any type of serialization (by cluster, by neuron or both at a time).
There is thus a drop by one order of magnitude: the resources needed for the routing and the learning pass to an order L2C or LC2, depending on whether it is the clusters or the neurons that are serialized. The barrier of L2C2 is crossed at the level of the computation resources: thus, through the use of the flip-flop ring according to the invention, only the quantity of memory needed remains at an order of magnitude L2C2, all the other resources are at a lower order of magnitude (L or C) and therefore become negligible relative to the memory in an overall assessment.
Thus, when the routing logic is replaced by this shift register, it is observed that the serial cluster versions and the serial neuron versions are equivalent in terms of surface as compared with previous architectures. This is because the drop in non-memory resources below the L2C2 limit makes these resources negligible relative to the memory itself, and therefore the resources occupied by the memory form the essential part of the resources of the architecture. The inventors have therefore practically eliminated the resources needed for learning and computation.
6.2.4. Combinations of Optimizations in Memory and Communications
The optimizations relating to computation (i.e. binarization) can be combined without any modification with optimizations targeting the memory (triangularization) or the serialization of the communications. On the contrary, certain combinations of the optimizations targeting the memory and the optimizations targeting the communications require dedicated architectural solutions.
Let us take a neural network of three clusters and three neurons each.
a) Serial Cluster Communications and Parallel Processing:
At an instant t, a single cluster (transmitter) broadcasts the state of all its neurons in parallel and all the other clusters (receivers) receive this information. These clusters access an adjacency matrix in parallel to retrieve their coefficients and locally carry out their processing operations. This mode of communication combines the triangularization of the matrix and the use of the flip-flop ring without problems (see
b) Serial Cluster Communications and Serial Cluster Processing:
At the instant t, a single cluster broadcasts the state of all its neurons in parallel and all the other clusters (receivers) receive this information. These clusters then access the adjacency matrix in series (one after the other) to retrieve their coefficients and locally carry out their processing operations. This serial access to the matrix makes it possible to pool the computation resources between the clusters. If the matrix is optimized by triangularization, then specific routing resources need to be added. Indeed, in this case, the use of a flip-flop ring as described here above can be applied simply (as illustrated in
Thus, the rows and the columns are not accessed simultaneously. Two rings are needed: a first ring is used to carry out an access in rows and a second ring is used to carry out an access in columns, as illustrated in
c) Serial Neural Communication and Parallel Processing:
At an instant t, all the clusters broadcast the state of one and only one of their neurons in parallel to all the other clusters. The clusters therefore access their adjacency matrix in parallel to retrieve their coefficients and locally carry out their processing operations. However, in the context of triangularization, since two clusters will access their shared adjacency matrix at the same time, and since they should not traverse it in the same sense, the use of a flip-flop ring to serialize the transfers of information is not trivial. Indeed, in this case, this means that it is necessary to make the flip-flop ring work both on the rows (access required by the cluster Ci) and on the columns (access required by the clusterC,) of the adjacency matrix. In fact, according to the invention, this amounts to implementing this flip-flop ring not on the rows and the columns of the adjacency matrix but on the diagonals (see
Let MCi, Cj be an adjacency matrix between two clusters Ci, Cj (or an adjacency matrix block) containing the adjacency coefficients w((k,g)(k′,g′)) which will be denoted as w(i, j); the permutation done by the flip-flop rings is then written as follows:
w(i,j)=w((i+1)mod L,(j+1)mod L);
This permutation offers the immense advantage of enabling the clusters to obtain read access to their coefficients, always in the same memory compartment. It is therefore not required to add additional routing resources and the size and/or number of access points are therefore advantageously reduced. Besides, it is also possible to have only one flip-flop ring whatever the type of access, in rows or in columns, and it is therefore not necessary to duplicate the flip-flop rings.
d) Serial Neural Communications and Serial Cluster Processing:
At the instant t, all the clusters broadcast the state of one and only one of their neurons in parallel to all the other clusters. These clusters access an adjacency matrix in series to retrieve their coefficients and carry out their processing operations locally. In the context of triangularization, two clusters never access their shared adjacency matrix at the same time. Hence even if they do not traverse it in the same sense, there is no particular conflict to be managed provided that two flip-flop rings are used (permutations on the rows for the first matrix and then permutations on the columns for the second matrix). It will be noted that it is also possible to use flip-flop rings diagonally to reduce the cost of this architectural solution.
Any other combination pertaining to the modes of communication or sharing of computations is done via all or part of the four major embodiments proposed here above. They all rely on the simple use of flip-flop rings, combined or diagonally.
6.2.5. Architecture of the Super Neuron Version (Version V2)
In this embodiment, the inventors have modified the interpretation of the theoretical GBNN model. The principle of this embodiment is described with reference to
Unlike in the classic GBNN neural networks, the information transmitted by the remote clusters are no longer the values of their neurons θ(nk,g, t) but the coefficients of their adjacency matrixw(i,j)(k,g) in the generalized equation (part 5.2.1). Indeed, the states of the connections are stored in the adjacency memory of the remote cluster and in the adjacency memory of the local cluster. Thus, when a cluster receives a sequence of values w(i,j)(k,g) of a remote cluster, it must interpret it as a sequence of binary values indicating whether the neuron of the remote cluster or clusters is or are connected or not connected with these local neurons. In fact, all that the local cluster now has to do is to carry out the computation to find out which local neurons are active (i.e. the WTA).
In order to provide different levels of optimization, the super neuron version can be divided into two modes of operation in the GBNN model:
In the hard decision model, the pieces of transmitted information are the values of the coefficients of the adjacency matrix for the single active neuron (if not, no value is transmitted or zero values are transmitted).
In the flexible decision model, the information received by the local clusters are the sums or the disjunctions (depending on whether the computation is an integer computation or binary computation) partial or not partial (depending on the embodiment) of the products or conjunctions between the value of an active remote neuron and its coefficient of the corresponding adjacency matrix. Indeed, these computations can be made for each local neuron in the remote clusters since the states of the connections are memorized in the adjacency memory of the remote cluster and the adjacency memory of the local cluster. Thus, the remote cluster does not transmit the values w(i,j)(k,g) of an active neuron but the accumulation of the valuesw(i,j)(k,g) of all its active neurons. Consequently, only the partial result of a local neuron computed in a remote cluster is transmitted to said neuron.
Let us take the example given in part 5.2.1 if it is considered that the cluster B is erased and that the WTA mechanism has determined that the neurons b1 and b2 are active whereas in the flexible decision model, the remote cluster B has transmitted the result of the following partial sum to the local cluster A: ((wa
The WTA can then be carried out whatever the mode of decision by the local cluster by using these received results of partial remote computations to determine which are the active local neurons.
Thus, in the case of a Super Neuron, we are no longer in the presence of neurons that exchange their states but in the presence of clusters that transmit a piece of information to the neurons informing them that they are connected or not connected to their active neurons.
Naturally, the optimizations described here above in this document can be combined with this original mode of operation. Thus, the inventors have also taken advantage of the possibility of serializing the transfers of information between clusters to simplify the architecture of computation of the WTA. Indeed, since the local cluster receives coefficients of adjacency of the active neurons of the other clusters, the transfer of this information in series to each of its neurons (serialization) replaces the simultaneous computations of the scores of all the neurons by a computation on the fly of the score of each local neuron concerned. This modification thus enables a drastic reduction of the computation resources needed for the WTA.
Although the present disclosure has been described with reference to one or more examples, workers skilled in the art will recognize that changes may be made in form and detail without departing from the scope of the disclosure and/or the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
1261155 | Nov 2012 | FR | national |
This Application is a Section 371 National Stage Application of International Application No. PCT/EP2013/074518, filed Nov. 22, 2013, which is incorporated by reference in its entirety and published as WO 2014/079990 on May 30, 2014, not in English.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2013/074518 | 11/22/2013 | WO | 00 |