Learning device and method

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to parallel processing, and, more particularly, to learning methods in devices such as neural networks with hidden units.
2. Description of the Related Art
Attempts to understand the functioning of the human brain have led to various "neural network" models in which large numbers of neurons are interconnected with the inputs to one neuron basically the outputs of many other neurons. These models roughly presume each neuron exists in one of two states (quiescent and firing) with the neuron's state determined by the states of the connected input neurons (if enough connected input neurons are firing, then the original neuron should switch to the firing state). Some models provide for feedback so that the output of a neuron may affect its input, and in other models outputs are only fed forward.
In a feedforward neural network, one set of neurons are considered input units, another set of neurons are considered output units, and, optionally, other neurons are considered hidden units. In such a situation, input patterns would stimulate input units which in turn would stimulate various layers of hidden units which then would stimulate output units to form an output. The aspect of learning in the human brain can be mimicked in neural networks by adjusting the strength of the connections between neurons, and this has led to various learning methods. For neural networks with hidden units there have been three basic approaches: (1) competitive learning with unsupervised learning rules employed so that useful hidden units develop, although there is no external force to insure that appropriate hidden units develop; (2) prescription of the hidden unit structure on some a priori grounds; and (3) development of learning procedures capable of leading to hidden unit structure adequate for the problem considered. See generally D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning Internal Representations by Error Propogation," Parallel Distributed Processing: Exploration in the Microstructure of Cognition, Volume 1: Foundations, pp. 318-362 (MIT Press, 1986) where a backward pass learning method is described and which is called the "generalized delta rule". The generalized delta rule basically has a neural network learn the correct outputs to a set of inputs by comparing actual outputs with correct outputs and modifying the connection strengths by a steepest descent method based on the differences between actual outputs and correct outputs.
However, the generalized delta rule has the problems of standard implementation with electronic devices does not directly lead to a compact architecture. Rather, the generalized delta rule leads to an architecture that focuses on the forward pass much more than the backward pass, and it is not clear how the same structural units would be used for both the forward pass computations and the backward pass computations. Indeed, the generalized delta rule is most often viewed as implementable by EEPROMs.
SUMMARY OF THE INVENTION
The present invention provides backward pass learning methods for feedforward neural network type devices and network architectures for these methods that use only local calculations and communications. Preferred embodiment architectures include nearest neighbor processor arrays that are expandable simply by addition of more processors to the arrays. Common processing for the forward and backward pass computations permits simple architecture, and various device types may be used for the processors such as general purposes processors, discrete digital procesors, analog processors, and quantum effect devices.

BRIEF DESCRIPTION OF THE DRAWINGS
The drawings are schematic for clarity.
FIG. 1 illustrates a basic unit of computation for a forward pass;
FIG. 2 shows the forward pass layers in a learning network;
FIGS. 3a and b illustrate semilinear functions;
FIG. 4 illustrates a basic building block;
FIGS. 5a and b is a two-dimensional array for calculation;
FIG. 6 shows a learning network constructed from the two-dimensional array of FIG. 5;
FIG. 7 shows a unit for error calculation;
FIGS. 8a-c illustrate flow graphs for the computations;
FIG. 9 shows a realization of the block of FIG. 4; and
FIG. 10 illustrates reduction in number of computation elements for a realization.

DESCRIPTION OF THE PREFERRED EMBODIMENTS
The generalized delta-rule (GDR) is a learning rule that can be applied to multilayer decision making neural networks. The network form is really quite simple. The network is composed of sequential layers. The first layer receives an input, processes this input in a predefined way and outputs it to the second level. The process of a level receiving an input, processing it, and passing it on to the next level, continues until the last layer which produces the output of the network. Each layer is composed of multiple neurons (units) which we refer to as forward-pass computational units. It is in these units the processing is performed as data passes forward through the network. FIG. 1 illustrates the basic structure of a forward pass computational unit (within the broken lines), and FIG. 2 illustrates the layered nature of the network.
We will find it very important to precisely define a notation for describing these networks. This notation is key to successfully manipulating the relationships between layers, units, and the computations performed by the units. Referring to FIG. 2, unit i in layer k is referred to as .sub.k U.sub.i. Similarly, .sub.k O.sub.j (n) is the output of unit j in layer k at "time" n, and .sub.k W.sub.ij (n) is the weight of the connection from unit j in layer k-1 (unit .sub.k-1 U.sub.j) to unit i in layer k (unit .sub.k U.sub.i) at time n.
The "output" .sub.-1 o.sub.j (n) is the input to the system. The number of units (outputs) in layer k is denoted as N.sub.k, and the units are labelled 0, 1, 2, . . . , N.sub.k. The maximum number of inputs per cell in layer k is given by the number of units in the previous layer k-1 which is N.sub.k-1. Throughout this analysis we will assume that there are K forward-pass layers in the network which are labelled 0, 1, 2, . . . , K-1. The forward pass equations which are used to derive the response of the network to an input are: ##EQU1##
In equation (1) the function .sub.k .function.i(.) is a semilinear function. A semilinear function (examples illustrated in FIG. 3) is nondecreasing and differentiable. Later we will find it useful to make this function not simply nondecreasing but also a strictly monotonically increasing function.
Implicit in the new notation used in equation (1) is the propagation inherent in the networks being analyzed since we note that .sub.k o.sub.i (n) cannot be calculated until we have determined .sub.k-1 o.sub.j (n). In other words, the output of layer k cannot be determined until the output of layer k-1 has been determined. Our propagation begins with layer 0 and .sub.-1 o.sub.j (n) which are the inputs we assume we are given.
The definition of .sub.k net.sub.i (n) is also important: ##EQU2## We can consider .sub.k net.sub.i (n) to be the strictly-linear portion of the response of a unit. This term will be of great importance to us later.
We will often times find it very useful to express equations (1), (3), and other results we derive in vector form. So, let us define several very useful vector quantities. First, let us organize the weights for a particular layer in a matrix form. We will refer to this matrix of weights for a particular layer k as .sub.k W(n) and define it to be: ##EQU3##
Thus we see that .sub.k W(n) is a N.sub.k .times.N.sub.k-1 matrix. Notice that row i of .sub.k W(n) is the set of weights associated with unit .sub.k u.sub.i at time n.
The output vector .sub.k o(n) is an N.sub.k element vector composed of the outputs of all of the units .sub.k u.sub.j at time n: ##EQU4##
The vector .sub.k net(n) is an N.sub.k element vector composed of the .sub.k net.sub.j (n) terms of all of the units .sub.k u.sub.j at time n: ##EQU5##
We will also need a vector function operation .sub.k f() which we define as: ##EQU6## As defined here, applying .sub.k f() to a vector uses the function .sub.k .function..sub.j () to operate on the j.sup.th element of a vector and yields the results as a vector.
Using this vector notation, equation (1) becomes: ##EQU7## and equation (3) becomes
.sub.k net(n)=.sub.k W(n).sub.k-1 o(n) (10)
A learning network does not only produce output signals. It must also learn from its mistakes. The algorithm for learning from mistakes is derived next. Let us introduce a measure of the error E(.sub.k w.sub.ij (n)) which is some, as yet undefined, measure of the error of the network due its response .sub.K-1 o(n) after receiving the input .sub.-1 o(n). We also assume that the only means for controlling this error is by the manipulation of the weights .sub.k w.sub.ij (n). Furthermore, let us assume that we will use a method of steepest-descent in order to update the weights in a network based upon the error. Thus, our general approach is to update the weights according to the following relationship: ##EQU8##
Of course, ##EQU9## is the component of the gradient of the error term with respect to the network weight .sub.k w.sub.ij (n), and .eta. is the learning rate. If the learning rate is taken to be small, then many passes through the network will be required; whereas, if the learning rate is take to be large, then oscillations may occur. We will find it convenient to treat the gradient as a matrix .gradient..sub.k E(n). For a particular layer k we define this as an N.sub.k .times.N.sub.k-1 matrix (the same dimensions as .sub.k W(n)) defined as follows: ##EQU10##
Similar to the meaning of .sub.k W(n), each row i of the matrix .gradient..sub.k E(n) represents the change in E(.sub.k w.sub.ij (n)) as a function of the weights associated with unit .sub.k u.sub.i. This then leads to a matrix form of the update equation (11):
.sub.k W(n+1)=.sub.k W(n)-.eta..gradient..sub.k E(n) (13)
Our problem is how to efficiently calculate the gradient matrix .gradient..sub.k E(n). To do this we must first have a better understanding of ##EQU11## We note from the chain rule that: ##EQU12##
Now let us consider the two terms on the right side of equation (14). From the definition of .sub.k net.sub.i (n) we see that: ##EQU13## Now we introduce a new definition: ##EQU14##
The vector .sub.k .delta.(n) is an N.sub.k element vector composed of the .sub.k .delta..sub.i (n) terms of all of the units .sub.k u.sub.i in row k at time n: ##EQU15##
Using the definition of .sub.k .delta..sub.i (n) we can write: ##EQU16##
Using equation (19) and the definitions of .gradient..sub.k E(n), .sub.k .delta.(n), and .sub.k o(n) we can say:
.gradient..sub.k E(n)=.sub.k .delta.(n).sub.k-1 o.sup.T (n)(20)
Now let us look at the .sub.k .delta..sub.j (n) term in a little more detail. Again, using the trusty chain-rule, we write: ##EQU17##
The term ##EQU18## is a measure of the dependency of the error on the output of unit j in layer k. The term ##EQU19## is a measure in the change of the output due to the `net` weighted inputs.
Since
.sub.k o.sub.j (n)=.sub.k .function..sub.j (.sub.k net.sub.j (n))(22)
then ##EQU20## Recall that previously we assumed .sub.k .function..sub.j () was differentiable, so we know that the derivative in (23) exists.
Therefore: ##EQU21##
We will now see how to calculate any .sub.k-1 .delta..sub.j (n) given .sub.k .delta..sub.j (n). For a particular layer k-1 we seek ##EQU22## Using the chain rule we write: ##EQU23##
This sum is a very important term which we, much like in the definition .sub.k net.sub.i (n), will define as .sub.k-1 delta.sub.j (n): ##EQU24## Writing .sub.k-1 delta.sub.j (n) in matrix form we get:
.sub.k-1 delta(n)=.sub.k W.sup.T (n).sub.k .delta.(n) (28)
Returning to (24), and using the result of (26), we can now write: ##EQU25##
Equation (31) is a very important equation. Inherent in this equation is the flow of the calculation necessary to determine .sub.k-1 .delta..sub.j (n). It explicitly shows that .sub.k-1 .delta..sub.j (n) may not be calculated until .sub.k .delta..sub.i (n) has been determined. This structure is very similar to the structure found in equation (1) which showed that the .sub.k o.sub.i (n) may not be calculated until .sub.k-1 o.sub.j (n) is determined.
In matrix notation equation (31) becomes ##EQU26## In equation (33) we have used diag(.sub.k-1 f'(.sub.k-1 net(n))) to represent the matrix which is all zero except for the diagonal which is composed of the elements of .sub.k-1 f'(.sub.k-1 net(n)).
For the final layer k=K-1 we calculate, albeit in an unspecified manner, ##EQU27## Then we see that, for the last layer ##EQU28##
Using the results of equation (32), given equation (34), we can calculate the error term for all of the layers in a backward pass through the network.
At no point in this derivation have we specified the form of E(). In other words, the derivation is completely independent of the particular error criterion used. We will later discuss some of these alternative error measures.
Table 1 is a summary of the scalar and vector equations of the first preferred embodiment method.
We start from the equations in Table 1 in the definition of an error-criterion independent architecture.
Reviewing the equations in Table 1 we observe that much of the computations necessary are localized on a per layer basis. For a layer k we know that we are always given, by a previous calculation, the terms .sub.k-1 o.sub.j (n) and .sub.k .delta..sub.i (n). We similarly know the weight terms .sub.k w.sub.ij (n) and the semilinear function .sub.k .function..sub.i (). These terms are all local to the layer k.
TABLE 1__________________________________________________________________________Summary of the scalar and vector equations.Scalar Equations Vector Equations__________________________________________________________________________Forward pass: Forward pass:Given input .sub.-1 o.sub.j (n). Given input vector .sub.-1 o(n). ##STR1## (3) .sub.k net(n) = .sub.k W(n).sub.k-1 o(n) (10).sub.k o.sub.i (n) = .sub.k f.sub.i (.sub.k net.sub.i (n)) (2) .sub.k o(n) = .sub.k f(.sub.k net(n)) (9)Backward pass: Backward pass:Calculate .sub.K-1 .delta..sub.i (n). Calculate .sub.K-1 .delta.(n). ##STR2## (27) .sub.k-1 delta(n) = .sub.k W.sup.T (n).sub.k .delta.(n) (28).sub.k-1 .delta..sub.j (n) = .sub.k-1 f'.sub.j (.sub.k-1 net.sub.j(n)).sub.k-1 delta.sub.j (n) (31) .sub.k-1 .delta.(n) = diag(.sub.k-1 f'(.sub.k-1 net(n))).sub.k-1 delta(n) (33)Update step: Update step: ##STR3## (19) .gradient..sub.k E(n) = .sub.k .delta.(n).sub.k-1 o.sup.T (n) (20) ##STR4## (11) .sub.k W(n + 1) = .sub.k W(n) - .eta..gradient..sub .k E(n) (13)__________________________________________________________________________
From these local terms we can immediately calculate .sub.k net.sub.i (n), .sub.k o.sub.i (n), .sub.k-1 delta.sub.j (n), ##EQU29## and the update of the weights .sub.k w.sub.ij (n). But there is a catch that appears to prevent us from completing all of the calculations on a local basis. To calculate .sub.k-1 .delta..sub.j (n) we need to know .sub.k-1 .function..sub.j '(.sub.k-1 net.sub.j (n)). This semilinear function is not local to the layer k since we have implied that .sub.k-1 net.sub.j (n) is local to layer k-1, so we must find some way to calculate it locally. We now turn to this problem.
We will first look at a special form of .sub.k .function..sub.i (x). We will use the sigmoid function ##EQU30## Differentiating equation (35) yields:
.sub.k .function..sub.i '(x)=.sub.k .function..sub.i (x)(1-.sub.k .function..sub.i (x)) (36)
This form of .sub.k .function..sub.i '(x) is very useful to us. If we now look at the term that prevented us from localizing our computations for a layer, then: ##EQU31## Thus we see that we can indeed use .sub.k-1 .function..sub.j '(.sub.k-1 net.sub.j (n)) on a local basis since it can be determined from .sub.k-1 o.sub.j (n) which is available to us on a local basis. Given this we can then subsequently calculate .sub.k-1 .delta..sub.j (n) as follows:
.sub.k-1 .delta..sub.j (n)=.sub.k-1 o.sub.j (n)(1-.sub.k-1 o.sub.j (n)).sub.k-1 delta.sub.j (n) (39)
We saw, in the previous paragraph, how, in a special case, we can express .sub.k-1 .function..sub.j '(.sub.k-1 net.sub.j (n)) as a function of .sub.k-1 o.sub.j (n). We can also do the same in less special circumstances. Let us require that .sub.k .function..sub.j (x) be not only semilinear but also strictly-monotonically increasing i.e. .sub.k .function..sub.j (x)>.sub.k .function..sub.j (y) x>y. Then the inverse of .sub.k .function..sub.j (x) exists and
.sub.k-1 net.sub.j (n)=.sub.k-1 .function..sub.j.sup.-1 (.sub.k-1 o.sub.j (n)) (40)
It then immediately follows that
.sub.k-1 .function..sub.j '(.sub.k-1 net.sub.j (n))=.sub.k-1 .function..sub.j '(.sub.k-1 .function..sub.j.sup.-1 (.sub.k-1 o.sub.j (n)))(41)
So if we require that .sub.k .function..sub.j (x) be semilinear and strictly-monotonically increasing, then we may always calculate .sub.k-1 .function..sub.j '(.sub.k-1 net.sub.j (n)) locally from .sub.k-1 o.sub.j (n) if we have local information describing .sub.k-1 .function..sub.j ().
Next we will exploit the local nature of the computations of the first preferred embodiment method with a first preferred embodiment structure for its implementation.
We will first focus on the purely linear-portions of the forward pass and backward pass operations, specifically: ##EQU32##
We will calculate each of these sums on a partial sum basis, that is
.sub.k s.sub.i(-1).sup.o (n)=0 (42)
.sub.k s.sub.ij.sup.o (n)=.sub.k s.sub.i(j-1).sup.o (n)+.sub.k w.sub.ij (n).sub.k-1 o.sub.j (n) (43)
.sub.k net.sub.i (n)=.sub.k s.sub.i(N.sbsb.k-1.sub.-1).sup.o (n)(44)
j=0,1, . . . , N.sub.k-1 -1 (45)
and
.sub.k s.sub.-1j.sup..delta. (n)=0 (46)
.sub.k s.sub.ij.sup..delta. (n)=.sub.k s.sub.(i-1)j.sup..delta. (n)+.sub.k w.sub.ij (n).sub.k .delta..sub.i (n) (47)
.sub.k-1 delta.sub.j (n)=.sub.k s.sub.(N.sbsb.k.sub.-1)j.sup..delta. (n)(48)
i=0,1, . . . , N.sub.k -1 (49)
For a particular i, j, and k we can calculate equations (43) and (47) using the preferred embodiment computational cell shown in FIG. 4. Note that the superscript (o or .delta.) on the s refers to the item being computed.
This cell has inputs .sub.k s.sub.i(j-1).sup.o (n), .sub.k-1 o.sub.j (n), .sub.k s.sub.(i-1)j.sup..delta. (n), and .sub.k .delta..sub.i (n). Its current state is defined by .sub.k w.sub.ij (n). It produces the outputs .sub.k s.sub.ij.sup.o (n) and .sub.k s.sub.ij.sup..delta. (n) according to (43) and (47). It updates its state according to equations (19) and (11): .sub.k w.sub.ij (n+1)=.sub.k w.sub.ij (n)-.eta..sub.k .delta..sub.i (n).sub.k-1 o.sub.j (n). Since the form of the computation for .sub.k s.sub.ij.sup.o (n), .sub.k s.sub.ij.sup..delta. (n), and .sub.k w.sub.ij (n+1) is the same, the same basic circuitry may be used to perform all three computations.
Constructing a two-dimensional array of these cells we can calculate equations (44), (48), (2), and (31). The first preferred embodiment array to do this is shown in FIG. 5 and would correspond to layer k in FIGS. 2 and 6. The bottom border-row calculates equation (31) and the rightmost column calculates equation (2).
The first preferred embodiment architecture shown in FIG. 5 has several important features:
1. All computations are done locally and all communication is to the nearest neighboring cells.
2. The regular structure makes it simple to expand the number of inputs and outputs for a layer composed of these simple cells. This is done by simply adding more rows and columns as required for the application.
3. The uniform, two-dimensional structure is ideal for implementations based upon existing analog and digital VLSI technology.
4. As wafer-scale integration becomes more of a practical technology, this structure is an ideal candidate for a wafer-scale implementation.
We can combine layers of these two dimensional first preferred embodiment arrays to form complete first preferred embodiment learning networks as shown in FIG. 6. Layer k in FIG. 6 is constructed of a two-dimensional array and has the associated vector inputs and outputs. Note that this figure also shows the use of an error calculation layer. Next we discuss what this layer does and how it might be implemented.
The error calculation layer shown in FIG. 6 calculates .sub.K-1.sup..delta. (n) given the output .sub.K-1 o(n) and the target vector d(n). The target vector is defined as: ##EQU33##
The target vector is an input to the network which specifies the desired output of the network in response to an input. Recall that the error term for layer K-1 is given by equation (34): ##EQU34##
And if we use the sigmoid function in equation (35) then ##EQU35##
A unit to implement this equation is shown in FIG. 7.
The most common form of error measure used is the square error: ##EQU36##
Another form of error is to take an exponentially-weighted average of past values of the square error: ##EQU37## In this case, the error can be calculated locally by locally averaging the same error term calculated for the square error case.
In both cases the error term can be calculated on a local basis with a simple functional unit.
The first preferred embodiment computational cell shown in FIG. 4 includes memories for the current (time n) values of .sub.k w.sub.ij and .sub.k-1 o.sub.j which are updated on each backward and forward pass, respectively. The cell also includes multiply and accumulate circuits for the partial sums and for the updating. In more detail, FIG. 8a is a flow graph for the computation of equation (43), FIG. 8b is a flow graph for the computation of equation (47), and FIG. 8c is a flow graph for the computation of equations (11) and (19) updating the state variables (weights). Combining these flow graphs provides a realization of the computational cell of FIG. 4 as illustrated in FIG. 9. The time evolution of the cells is obtained by .eta. and the state variable .sub.k w.sub.ij (n).
FIG. 10 is a modified version of the cells of FIGS. 8a-b where use of multiplexers has reduced the number of multipliers from two to one and the number of adders from two to one. Where the leftmost inputs/outputs are selected in the multiplexers, equation (43) is computed; whereas selection of the rightmost inputs/outputs computes equation (47). Similarly, including further multiplexers can reduce the number of computational elements in the full cell of FIG. 9.
The first preferred embodiment array of FIG. 5 is made of such cells plus the righthand column of processors that have .sub.k .function..sub.i stored in ROM (EEPROM) for computations of the outputs and the bottom row of processors that just add and multiply.
While the first preferred embodiment method is philosophically the same as the original GDR, it is algorithmically different in several fundamental ways; most notably for its localization of calculations and communication and its error-criteria independence.
These features were used to formulate the first preferred embodiment network architecture. In this architecture all computations can be performed on a local basis and using only nearest-neighbor communication. Furthermore, the first preferred embodiment computational cells support the construction of learning networks from simple two-dimensional arrays. The cells and arrays are particularly well suited for implementation in analog or digital VLSI technology.
Another important new result of the first preferred embodiment is that the architecture of the network is completely independent (except for the last layer) of the particular error criterion employed. Such a robust architecture will make it simpler to construct networks and to analyze the effect of different error criteria. This independence increases the importance of the cells used since they can be used in a wide variety of networks and not just a limited niche of learning networks.
MODIFICATIONS AND ADVANTAGES
Various modifications of the preferred embodiment devices and methods may be made while retaining the features of local layer computation by nearest neighbor simple cells.
Common processing for the forward and backward pass computations permits simple architecture, and various device types may be used for the processors such as general purposes processors, discrete digital procesors, analog processors, and quantum effect devices.
The advantages of the present invention include the localized computation that permits large scale integration of the processors (units) and the common processing for the forward and backward pass computations for reduction of total amount of hardware logic required.

Learning device and method

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

US Classifications

Field of Search

US

International Classifications

Abstract

Description

Claims

US Referenced Citations (1)

Non-Patent Literature Citations (1)