The present disclosure relates generally to neural networks. More particularly, the present disclosure relates to leaky spiking neural networks that perform temporal encoding.
Traditionally, artificial neural networks have been predominantly constructed from idealized neurons that use non-linear activation layers to generate continuous activation values based on a set of weighted inputs. Some neural networks have multiple sequential layers of such neurons, in which case they may be referred to as “deep” neural networks.
Non-spiking neural networks typically pass information through the network using non-linear activation layers that produce continuous-valued outputs. These non-linear activation layers are differentiable, which enables the gradient of a loss function with respect to the weights of the network to be determined. In the multi-layer case, the existence of the gradient of the loss function makes it possible to use gradient-based optimization methods in combination with the backpropagation algorithm to learn particular weight values that enable the network to accurately perform a certain task.
Gradient-based optimization techniques (e.g., gradient descent) have been highly successful in training continuous-valued neural networks. However, gradient-based techniques do not easily transfer to spiking neural networks due to the hard nonlinearity of spike generation and the discrete nature of spike communication.
Furthermore, spiking neural networks are dynamic systems in which the respective times at which various neurons spike play a critical role. This is in contrast to conventional feedforward neural networks in which time is abstracted away. In particular, state transfer in classic neural nets happens globally and synchronously.
Synchronous systems have to distribute a clock and lose some possibility of phase to be used to extend the information transfer bandwidth between the neurons. From the bandwidth viewpoint, ideally the neurons would self-synchronize. That would eliminate the clock distribution requirement and would increase the information transfer bandwidth in both hardware and software implementations of recurrent neural networks.
More particularly, unlike non-spiking neurons that output analog values, spiking neurons typically communicate using discrete spikes which are binary in nature (e.g., either a spike is output or not). Typically a spike triggers a trace of synaptic current in the receiving neuron or otherwise impacts a membrane potential of the receiving neuron. In some example formulations, the receiving neuron integrates received synaptic current over time until a firing threshold is reached, at which time the neuron itself spikes or fires. Due to their hard nonlinearity, neuron spike rates are typically non-differentiable, which has prevented widespread application of gradient-based techniques to spiking neural networks.
Thus, while backpropagation is an established general technique for training traditional non-spiking neural networks, a general technique for training spiking neural networks has not yet been established. Certain previous approaches that train spiking neural networks to produce particular spike patterns depend on the absence of any hidden layers (e.g., the input layer is directly connected to the output layer). Thus, multi-layer networks cannot be trained using these approaches.
It remains a challenge to train spiking networks, especially with multi-layer learning (e.g., deep spiking neural networks). Enabling learning within multi-layer spiking neural networks is an area of ongoing development and has potential to greatly improve the performance of spiking neural networks on different tasks.
Aspects and advantages of embodiments of the present disclosure will be set forth in part in the following description, or can be learned from the description, or can be learned through practice of the embodiments.
One example aspect of the present disclosure is directed to a computer system that includes one or more processors and one or more non-transitory computer readable media. The one or more non-transitory computer readable media collectively store a machine-learned spiking neural network that includes one or more spiking neurons that have an activation layer that uses a double exponential function to model a leaky input that an incoming neuron spike provides to a membrane potential of the spiking neuron. The one or more non-transitory computer readable media collectively store instructions that, when executed by the one or more processors, cause the computer system to perform operations. The operations include obtaining a network input. The operations include implementing the machine-learned spiking neural network to process the network input. The operations include receiving a network output generated by the machine-learned spiking neural network as a result of processing the network input.
In some implementations, the machine-learned spiking neural network encodes information in respective spike times associated with the one or more spiking neurons.
In some implementations, the double exponential function models the leaky input as a double exponential pulse. In some implementations, the double exponential function has the form e−t (t−1+c), where c is a hyperparameter. In some implementations, the double exponential function has the form te−t.
In some implementations, the membrane potential of each of the one or more spiking neurons, if it has not spiked, has the form Σi wi(t−ti)et
In some implementations, implementing the machine-learned spiking neural network includes determining, for each of the one or more spiking neurons, a spike time that corresponds to an earliest time at which the membrane potential of the spiking neuron is equal to a firing threshold.
In some implementations, determining, for each of the one or more spiking neurons, the spike time includes applying a Lambert W function to determine the spike time.
In some implementations, the operations further include: prior to obtaining the network input, training the machine-learned spiking neural network on training data via a gradient descent technique. In some implementations, training the machine-learned spiking neural network via the gradient descent includes determining, for each of the one or more spiking neurons, one or both of: a derivative of a spike time of such spiking neuron with respect to the time points ti; and a derivative of the spike time of such spiking neuron with respect to one or more of the weights wi, wherein the spike time corresponds to an earliest time at which the membrane potential of such spiking neuron is equal to a firing threshold. In some implementations, training the machine-learned spiking neural network via the gradient descent includes modifying, for each of the one or more spiking neurons, at least one of the weights wi based at least in part on one or both of the derivative of the spike time of such spiking neuron with respect to the time points ti and the derivative of the spike time of such spiking neuron with respect to one or more of the weights wi.
In some implementations, the machine-learned spiking neural network includes a plurality of layers, at least two of the plurality of layers including at least one of the one or more spiking neurons, and the machine-learned spiking neural network has been trained on training data using a backpropagation technique.
In some implementations, the operations further include: training the machine-learned spiking neural network on training data via a gradient descent technique to simultaneously learn both parameters of the machine-learned spiking neural network and a topology of the machine-learned spiking neural network.
Another example aspect of the present disclosure is directed to a computer-implemented method to train a spiking neural network that encodes information in respective spike times associated with a plurality of spiking neurons included in the spiking neural network. The method includes obtaining, by one or more computing devices, data descriptive of the spiking neural network that includes the plurality of spiking neurons. Each of the plurality of spiking neurons is respectively connected to one or more pre-synaptic neurons via one or more artificial synapses that have one or more weights associated therewith. Each of the plurality of spiking neurons has an activation layer that controls a respective spike time of such spiking neuron based on a membrane potential of such spiking neuron. The activation layer for each of the plurality of spiking neurons includes a double exponential that models incoming spikes received from the one or more presynaptic neurons as leaky inputs to the membrane potential. The method includes training, by the one or more computing devices, the spiking neural network based on a set of training data. Training, by the one or more computing devices, the spiking neural network includes: determining, by the one or more computing devices, a gradient of a loss function that evaluates a performance of the spiking neural network on the set of training data; and modifying, by the one or more computing devices for at least one of the plurality of spiking neurons, at least one of the one or more weights based at least in part on the gradient of the loss function.
In some implementations, each of the plurality of spiking neurons receives the incoming spikes from the one or more presynaptic neurons at respective inbound spike times. In some implementations, determining, by the one or more computing devices, the gradient of the loss function includes determining, by the one or more computing devices, for at least one of the plurality of spiking neurons, a derivative of the spike time of such spiking neuron with respect to the inbound spike times.
In some implementations, determining, by the one or more computing devices, the gradient of the loss function includes determining, by the one or more computing devices, for at least one of the plurality of spiking neurons, a derivative of the spike time of such spiking neuron with respect to one or more of the weights associated with such spiking neuron.
In some implementations, training, by the one or more computing devices, the spiking neural network further includes modifying, by the one or more computing devices for at least one of the plurality of spiking neurons, at least one synaptic delay parameter based at least in part on the gradient of the loss function.
In some implementations, the plurality of spiking neurons are arranged in a plurality of layers. In some implementations, training, by the one or more computing devices, the spiking neural network includes backpropagating, by the one or more computing devices, the loss function through the plurality of layers.
In some implementations, for each of the plurality of spiking neurons, the membrane potential, if such spiking neuron has not yet spiked, has the form Σi wi(t−ti)et
Another example aspect of the present disclosure is directed to an electronic device. The electronic device includes a machine-learned spiking neural network that includes one or more spiking neurons. Each of the one or more spiking neurons has an activation layer that uses a double exponential function to model a leaky input that an incoming neuron spike provides to a membrane potential of the spiking neuron. The machine-learned spiking neural network is configured to receive a network input and to process the network input to generate a network output.
In some implementations, the machine-learned spiking neural network includes computer-readable instructions stored on a non-transitory computer-readable medium.
In some implementations, the machine-learned spiking neural network includes one or more electronic circuits that include electronic components arranged to execute the machine-learned spiking neural network using electrical current.
In some implementations, for each of the one or more spiking neurons, the corresponding electronic components that model the double exponential function include two capacitors, two resistors, and one or more transistors.
Other aspects of the present disclosure are directed to various systems, apparatuses, non-transitory computer-readable media, user interfaces, and electronic devices.
These and other features, aspects, and advantages of various embodiments of the present disclosure will become better understood with reference to the following description and appended claims. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate example embodiments of the present disclosure and, together with the description, serve to explain the related principles.
Detailed discussion of embodiments directed to one of ordinary skill in the art is set forth in the specification, which makes reference to the appended figures, in which:
Reference numerals that are repeated across plural figures are intended to identify the same features in various implementations.
Generally, the present disclosure is directed to spiking neural networks that perform temporal encoding for phase-coherent neural computing. In particular, according to an aspect of the present disclosure, a spiking neural network can include one or more spiking neurons that have an activation layer that uses a double exponential function, which can also be referred to as an “alpha function,” to model a leaky input that an incoming neuron spike provides to a membrane potential of the spiking neuron. The use of the double exponential function in the neuron's temporal transfer function creates a better defined maximum in time. This allows very clearly defined state transitions between “now” and the “future step” to happen without loss of phase coherence.
More particularly, the present disclosure provides biologically-realistic synaptic transfer functions, for example of the form te−t, produced by the integration of exponentially decaying kernels. In contrast with the single exponential function, the double exponential function gradually rises before slowly decaying (see, e.g.,
Therefore, aspects of the present disclosure are directed to spiking network models that use the double exponential function for synaptic transfer and encodes information in relative spike times. The networks can be fully trained in temporal domain using exact gradients over domains where relative spiking order is preserved. Example experimental results with models of this type have been shown capable of learning standard benchmark problems, such as Boolean logic gates and MNIST, encoded in individual spike times. To facilitate transformations of the class boundaries, synchronization pulses can be used, which are neurons that send spikes at input-independent, learned times.
The proposed model are easily able to solve temporally-encoded Boolean logic and other benchmark problems. An analysis of the behavior of the spiking network during training shows that it spontaneously displays two operational regimes that reflect a trade-off between speed and accuracy: a slow regime that is slow but very accurate, and a fast regime that is slightly less accurate but makes decisions much faster.
Thus, the present disclosure develops the idea of temporal coding in leaky neurons (e.g., leaky integrate-and-fire neurons). One primary aspect described herein is the encoding of information in the spike times of spiking neurons, rather than spike rates. In particular, the output of a neuron can be its spike time, which can depend on the timings and weights of presynaptic neurons that cause it to fire. The formulation of a neuron's spike time in the continuous time domain renders it differentiable, which enables usage of backpropagation and gradient-based techniques to learn the spike timings in the network. This also optionally allows the addition of synaptic delays, also trainable using backpropagation techniques.
As such, according to another aspect, the present disclosure provides systems that enable application of gradient-based learning algorithms to learn the double exponential time transfer function. Furthermore, the systems described herein can implement the gradient-based learning algorithm to learn to build internal states in a recurrent network, allowing the network to learn states and state transfers faster.
The present disclosure provides a number of technical effects and benefits. As one example technical effect and benefit, by encoding information in spike times, the use of spike counts or spike rates can be eliminated. Further, as described herein, the neuron spike times can be formulated as a continuous representation which is differentiable and therefore amenable to gradient-based training techniques. Use of gradient-based techniques allows precise learning within the network (e.g., at the level of single spike times) and naturally extends to multi-layer scenarios, which would not be possible in training approaches based on rate-coding. In addition, use of gradient-based techniques for training the network can be more efficient than various other existing techniques which are more computationally expensive.
Enabling efficient training of spiking neural networks with gradient-based techniques provides further technical effects and benefits. By the techniques described herein enabling the training of spiking neural networks with gradient-based techniques, spiking neural networks can be trained to perform many supervised and reinforcement learning tasks where it was previously impossible, or at least infeasible, to train spiking neural networks to perform these tasks. In many instances, implementations of spiking neural networks on neuromorphic hardware can operate with significantly less energy resources than alternatives capable of performing these tasks, e.g. perceptron-based networks.
The trained spiking neural networks described above may be suited to perform a range of machine learning tasks. In particular, the inherently temporal nature of the trained spiking neural networks makes them particularly suited for machine learning tasks operating on temporal data, such as audio, video and/or sensor data. Examples of such machine learning tasks include speech recognition, event detection, and pattern recognition.
As another example technical effect and benefit, by encoding information in continuous space spike times, the network can be enabled to operate asynchronously. This better models the human brain and enables use of differential equations. Further, in some implementations, use of an asynchronous network can enable multiple rhythms or flows of information to propagate through the network at the same time, which can allow for parallel, sequential, and/or recurrent processing of input.
As another example technical effect and benefit, by encoding information in spike times, neuron firing can be highly sparse because the time of each spike can encode a large amount of information. As such, the networks described herein can be much more efficiently implemented than networks which encode information using spike rates, which themselves can be more efficient than traditional non-spiking networks. More particularly, since temporal encoding neurons typically fire many fewer times than rate-based encoding neurons, less computing resources (e.g., energy resources, processing resources, memory resources, etc.) are required to be expended to run the network. Thus, by encoding in spike time (e.g., high information content in spikes that are sparse in time) rather than spike rate, the number of neuron spikes (e.g., each of which can consume resources) can be greatly reduced.
Use of the double exponential function in the neuron's activation layer also provides technical effects and benefits. As one example, the double exponential function better mimics actual biological neuron behavior and provides a natural inherent rhythm/speed for information propagation within the network.
As another example, the double exponential function creates a better defined maximum in time (e.g., as opposed to a square wave representation, single exponential representation, or other monotonic representation). This allows very clearly defined state transitions between “now” and the “future step” to happen without loss of phase coherence.
In addition, summing or integration of incoming spikes can happen more effectively as the incoming spike's impact is moved from the exact immediate time of receipt to a slightly delayed point in the future. This slight delay enables more information to be collected prior to neuron spiking.
The use of a double exponential function also enables differentiation to occur with a double differential instead of a single differential. The optimization surface for the double differential is often smoother than that of the single differential, which will often exhibit ripples. This smoother optimization surface can result in faster training times and better convergence, as the gradient descent technique is able to more quickly and easily locate an optimal point on the surface. Faster training and better convergence can result in savings of various resources as less computing resources (e.g., energy resources, processing resources, memory resources, etc.) are required to be expended to train the network.
Although particular emphasis is placed on use of the double exponential function in the present disclosure, other functions could be used in addition or alternatively to the double exponential function. As examples, a Gaussian or a Poisson distribution could be used as or in a temporal activation layer. As other examples, other non-monotonic and/or unimodal functions can be used in addition or alternatively to the double exponential function. In general, aspects of the present disclosure can be applied to and/or use any function that is smooth, always positive, has a single maximum in the near future, and becomes zero in the far future.
In example implementations of the proposed models, information can be encoded in the relative timing of individual spikes. The input features can be encoded in temporal domain as the spike times of individual input neurons, with each neuron corresponding to a distinct feature. More salient information about a feature can be encoded as an earlier spike in the corresponding neuron. Information can propagate through the network in a temporal fashion. Each hidden and output neuron can spike when its membrane potential rises above a fixed threshold. Similarly to the input layer, the output layer of the network can encode a result in the relative timing of output spikes. In other words, the computational process can include producing a temporal sequence of spikes across the network in a particular order, with the result encoded in the ordering of spikes in the output layer.
This model can be used solve standard classification problems. Given a classification problem with m inputs and n possible classes, the inputs can be encoded as the spike times of individual neurons in the input layer and the result can be encoded as the index of the neuron that spikes first among the neurons in the output layer. An example drawn from class k is classified correctly if and only if the kth output neuron is the first to spike. An earlier output spike can reflect more confidence of the network in classifying a particular example, as it implies more synaptic efficiency or a smaller number of presynaptic spikes. In a biological setting, the winning neuron could suppress the activity of neighbouring neurons through lateral inhibition, while in a machine learning setting the spike times of the non-winning neurons can be useful in indicating alternative predictions of the network. The learning process aims to change the synaptic weights and thus the spike timings in such a way that the target order of spikes is produced.
Each synapse 18, 20, 22 can have an adjustable weight 24, 26, 28 (e.g., scalar weight) associated therewith. The weights 24, 26, 28 can be changed as a result of learning. As described above, techniques for performing this learning rule within the spiking neural network context have been one of the most challenging components for developing multi-layer spiking neural networks because the non-differentiability of spike trains has limited application of the backpropagation algorithm.
Referring again to
More particularly, the spiking neuron 10 can have a membrane potential 30. The membrane potential 30 can be a continuous-valued function of time. In particular, the activity (e.g., transmitted spikes) of the presynaptic neurons 12, 14, 16 can modulate or otherwise impact the membrane potential 30 of spiking neuron 10. The spiking neuron 10 can also have an activation layer 32, which controls the spiking of the neuron (e.g., a spike time of the neuron 10) based on the membrane potential 30.
As one example, the activation layer 32 can generate an action potential or spike when the membrane potential 30 crosses a firing threshold. Thus, in one example, implementing the spiking neuron 10 can include determining a spike time that corresponds to an earliest time at which the membrane potential 30 of the spiking neuron 10 is equal to a firing threshold.
When the spiking neuron 10 fires or spikes, a spike can be sent along one or more downstream synapses 34 to one or more downstream neurons. Alternatively, depending on the position of the neuron 10 in the model structure, the spike can be an output of the network. Although one downstream synapse 34 is shown, the spike output by the neuron 10 can be sent down any number of downstream synapses 34.
Although not explicitly shown in
According to an aspect of the present disclosure, the activation layer 32 of the spiking neuron 10 can use a double exponential function to model a leaky input that an incoming neuron spike (e.g., an incoming spike from one of the presynaptic neurons 12, 14, 16) provides to the membrane potential 30 of the spiking neuron 10. In particular, this is obtained by integrating over time the incoming exponential synaptic current kernels of the form ε(t)=τ−1 e−τt, where τ is the decay constant. The potential of the neuronal membrane in response to a single incoming spike is then of the form u(t)=te−τt. This function has a gradual rise and a slow decay, peaking at tmax=τ−1. Every synaptic connection has an efficiency, or a weight. The decay rate has the effect of scaling the induced potential in amplitude and time, while the weight of the synapse has the effect of scaling the amplitude only
The use of the double exponential function in the neuron's activation layer 32 creates a better defined maximum in time. This allows very clearly defined state transitions between “now” and the “future step” to happen without loss of phase coherence.
More particularly, in some implementations, the double exponential function can model a leaky input as a double exponential pulse. A double exponential function can be any function that adheres to the following: e−At−e−Bt, with A<B, defined positive time t. For example, in some implementations, the double exponential function can take the form e−t(t−1+c), where c is a hyperparameter. In instances in which c is set equal to 1, the double exponential function can take the form te−t.
Referring again to
On the other hand, if a neuron has spiked, then there are several methods to “reset” it. One example is to restore the membrane potential to its default value and/or let the neuron be in a refractory period where it is unable to react to incoming stimuli.
Thus, the neuron 10 spikes when the membrane potential 30 crosses the firing threshold (see
This can be achieved by sorting the inputs and adding them to It
Eq. 2 has two potential solutions—one on the rising part of the function and one on the decaying part. If a solution exists (in other words, if the neuron spikes), then its spike time is the earlier of the two solutions.
For a set of inputs I, denote AI=Σi∈Iwieτt
A spike will occur whenever the Lambert W function has a valid argument and the resulting tout is larger than all input spikes. As the earlier solution of this equation is valued, the main branch of the Lambert W function can be employed. The Lambert W function is real-valued when its argument is larger than or equal to −e−1. It can be proven that this is always the case when Eq. 2 has a solution, by expanding Vmem (tmax)≥θ, where
is the peak of the membrane potential function corresponding to the presynaptic set of inputs I.
One example spiking neural network architecture according to the present disclosure can include one or more (e.g., many) spiking neurons and/or non-spiking neurons. Some or all of the spiking neurons can have the structure and function illustrated in and described with respect to
In some implementations, the neurons of the spiking neural network can be arranged in multiple sequential layers, including, for example, multiple sequential layers that each include spiking neurons (e.g., a “deep” spiking neural network). In one particular example, one or more layers that include spiking neurons can be followed by one or more layers that include non-spiking neurons.
The spiking network can be a feed-forward network, a recurrent network, a convolutional network, or combinations thereof. Connections between neurons in adjacent layers can be structured in an all-to-all configuration and/or in a sparse configuration.
In some implementations, the spiking neural network can encode information in the spike times of spikes that are output by spiking neurons of the network. Thus, the information output of a neuron can be encoded in its spike time, which depends on the timings and weights of presynaptic neurons that caused it to fire. This can enable the network to operate asynchronously. This better models the human brain and enables use of differential equations and backpropagation to adjust the spike timings in the network.
In some implementations, for example in a classification problem, the input class can be determined by which neuron in the output layer spikes first. In some implementations, each spiking neuron in the network is allowed to spike only once per cycle.
Further, in some implementations, use of an asynchronous network can enable multiple rhythms or flows of information (also known as “wavefronts”) to propagate through the network at the same time, which can allow for parallel, sequential, and/or recurrent processing of input. For example, multiple wavefronts can propagate through the network at different phases (e.g., different but coherent phases). Propagation of wavefronts in this manner does not rely on synchronized clocking. Instead, the wavefront is itself the clocking. In some implementations, explicit clocking policies can be imposed at or around interfaces for data input and/or output.
In one particular example, the spiking neural network can be toroidal in structure. In such implementations, wavefronts can be cyclically propagated around the toroidal network with or without additional input, output, and/or other modifications (e.g., sequential input can be input over time at each cycle).
In some implementations, the spiking neural networks can be implemented in the form of computer-readable instructions stored in a computer-readable medium which are accessed and executed by one or more processors. Alternatively or additionally, the spiking neural networks can be implemented in the form of one or more electronic circuits that include electronic components arranged to execute the machine-learned spiking neural network using electrical current. As an example, the corresponding electronic components that model the double exponential function can include two capacitors, two resistors, and one or more transistors.
As one example training technique, backpropagation techniques can be used in combination with gradient-based techniques to backpropagate a loss through multiple layers of a network. For example, the loss can be a supervised loss of a loss function that evaluates the performance of the network on a set of labeled training data. Thus, in some implementations, training the spiking neural network can include determining a gradient of a loss function that evaluates a performance of the spiking neural network on the set of training data; and modifying, for at least one of the plurality of spiking neurons, at least one of the one or more weights based at least in part on the gradient of the loss function.
As one example, the spiking network can learn to solve problems whose inputs and solution are encoded in the times of individual input and output spikes. Therefore, one possible goal is to adjust the output spike times so that their relative order is correct. Given a classification problem with n classes, the neuron corresponding to the correct label should be the earliest to spike. Therefore, one example loss function that can be used seeks to minimize the spike time of the target neuron and maximize the spike time(s) of the non-target neurons. Note that this is the opposite of the usual classification setting involving probabilities, where the value corresponding to the correct class is maximised and those corresponding to incorrect classes are minimised. As one example technique to achieve this effect, the softmax function can be used on the negative values of the spike times oi (which are always positive) in the output layer: pj=e−o
Cross-entropy loss can be used the usual form: L(yi,pi)=−Σi=1nyilnpi, where yi is an element of the one-hot encoded target vector of output spike times. Taking the negative values of the spike times ensures that minimizing the cross-entropy loss minimizes the spike time of the correct label and maximizes the rest.
In some implementations, determining the gradient of the loss function (e.g., the loss described above or other loss functions) can include determining, for at least one of the plurality of spiking neurons, a derivative of the spike time of such spiking neuron with respect to the weights associated with such spiking neuron.
As one example, to minimize the cross-entropy loss described above, a training system can change the value of the weights across the network. This has the effect of delaying or advancing spike times across the network. For any presynaptic spike arriving at time tj∈I with weight wj, denote
and compute the exact derivative of the postsynaptic spike time with respect to any presynaptic spike time tj and its weight wj as:
As the postsynaptic spike time moves earlier or later in time, when It
In some implementations, one or more synaptic delay parameters associated with the neuron can be trained using this gradient. As such, in some implementations, determining the gradient of the loss function can include determining, for at least one of the plurality of spiking neurons, a derivative of the spike time of such spiking neuron with respect to the weights associated with such spiking neuron and the inbound spike times associated with inbound spikes received by such neuron.
Additional example details regarding the derivation of the above gradient expressions are contained in U.S. Provisional Patent Application No. 62/744,150.
In some implementations, in order to adjust the class boundaries in the temporal domain, a temporal form of bias can be used to adjust spike times, i.e. to delay or advance them in time. In this model, synchronization pulses can act as additional inputs across some or all of the layers of the network, in order to provide temporal bias across the network. These can be thought of as similar to internally-generated rhythmic activity in biological networks, such as alpha waves in the visual cortex or theta and gamma waves in the hippocampus.
A set of pulses can be connected to all neurons in the network, to neurons within individual layers, or to individual neurons. A per-neuron bias is biologically implausible and more computationally demanding, hence some of the proposed models use either a single set of pulses per network, to solve easier problems, or a set of pulses per layer, to solve more difficult problems. All pulses can be fully connected to either all non-input neurons in the network or to all neurons of the non-input layer they are assigned to.
Each pulse can spike at a predefined and trainable time, providing a reference spike delay. Each set of pulses can be initialized to spike at times evenly distributed in the interval (0,1). Subsequently, the spike time of each pulse can be learned using Eq. 4, while the weights between pulses and neurons are trained using Eq. 5, in the same way as all other weights in the network.
Example experiments were conducted on fully connected feedforward networks with topology n_hidden (a vector of hidden layer sizes). Adam optimization was used with mini-batches of size batch_size to minimise the cross-entropy loss. The Adam optimizer performed better than stochastic gradient descent. Different learning rates were used for the pulse spike time (learning_rate_pulses) and the weights of both pulse and non-pulse neurons (learning_rate). A fixed firing threshold (fire_threshold) and decay constant (decay_constant) were used.
Network weight initialisation is crucial for the subsequent training of the network. In a spiking network, it is important that the initial weights are large enough to cause at least some of the neurons to spike; in absence of spike events, there will be no gradient to use for learning. Therefore, in some implementations, a modified form of Glorot initialization can be used where the weights are drawn from a normal distribution with standard deviation σ=√{square root over (2.0/(fanin+fanout))} (as in the original scheme) and custom mean μ=multiplier×σ. If the multiplication factor of the mean is 0, this is the same is the original Glorot initialization scheme. Different multiplication factors can be set for pulse (pulse_init_multiplier), and non-pulse (nonpulse_init_multiplier) weights. This allows the two types of neurons to pre-specialise into inhibitory and excitatory roles. In biological brains, internal oscillations are thought to be generated through inhibitory activities that regulate the excitatory effects of incoming stimuli.
Some example possible hyperparameters of the model are shown in the table below. The first column shows the default parameters chosen to solve Boolean logic problems. The second column shows the search range used in the hyperparameter search. Asterisks (*) mark ranges that were probed according to a logarithmic scale; all others were probed linearly. The last column shows the value chosen from these ranges to solve an example MNIST-based experiment.
Despite careful initialization, in some instances, the network might still become quiescent during training. This problem can be prevented by adding a fixed small penalty (penalty_no_spike) to the derivative of all presynaptic weights of a neuron that has not fired. In practice, after the training phase, some of the neurons will spike too late to matter in the classification and thus they do need to spike at all.
Another problem is that the gradients become very large as a spike becomes closer to, but not sufficient for the postsynaptic neuron to reach the firing threshold. In this case, in Eq. 4 and 5, the value of the Lambert W function will approach its minimum (−1) as its argument approaches −e−1, the denominator of the derivatives will approach zero and the derivatives will approach infinity. To counter this, the derivatives can be clipped to a fixed value clip_derivative. Note that this behavior will occur in any activation function that has a maximum (hence, a biologically-plausible shape), is differentiable, and has a continuous derivative.
In addition to these hyperparameters, several other heuristics for the spiking net can optionally be used. These include weight decay, adding random noise during training to the spike times of either the inputs or all non-output neurons in the network, averaging over brightness values in a convolutional-like manner and adding additional input neurons responding to the inverted version of the image, akin to the on/off bipolar cells in the retina. Additionally, in some implementations, presynaptic neurons can be removed from the presynaptic set once their individual contribution to the potential decayed below a decay threshold. This can be achieved by solving an equation similar to Eq. 2 for reaching a decay threshold on the decaying part of the function, using the −1 branch of the Lambert W function.
The user computing device 102 can be any type of computing device, such as, for example, a personal computing device (e.g., laptop or desktop), a mobile computing device (e.g., smartphone or tablet), a gaming console or controller, a wearable computing device, an embedded computing device, or any other type of computing device.
The user computing device 102 includes one or more processors 112 and a memory 114. The one or more processors 112 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, a FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 114 can include one or more non-transitory computer-readable storage mediums, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 114 can store data 116 and instructions 118 which are executed by the processor 112 to cause the user computing device 102 to perform operations.
In some implementations, the user computing device 102 can store or include one or more spiking neural networks 120. For example, the spiking neural networks 120 can be or can otherwise include spiking neurons as described herein. Neural networks can include feed-forward neural networks, recurrent neural networks (e.g., long short-term memory recurrent neural networks), convolutional neural networks or other forms of neural networks. Example spiking neural networks 120 are discussed with reference to
In some implementations, the one or more spiking neural networks 120 can be received from the server computing system 130 over network 180, stored in the user computing device memory 114, and then used or otherwise implemented by the one or more processors 112. In some implementations, the user computing device 102 can implement multiple parallel instances of a single spiking neural network 120.
Additionally or alternatively, one or more spiking neural networks 140 can be included in or otherwise stored and implemented by the server computing system 130 that communicates with the user computing device 102 according to a client-server relationship. For example, the spiking neural networks 140 can be implemented by the server computing system 140 as a portion of a web service. Thus, one or more networks 120 can be stored and implemented at the user computing device 102 and/or one or more networks 140 can be stored and implemented at the server computing system 130.
The user computing device 102 can also include one or more user input component 122 that receives user input. For example, the user input component 122 can be a touch-sensitive component (e.g., a touch-sensitive display screen or a touch pad) that is sensitive to the touch of a user input object (e.g., a finger or a stylus). The touch-sensitive component can serve to implement a virtual keyboard. Other example user input components include a microphone, a traditional keyboard, or other means by which a user can provide user input.
The server computing system 130 includes one or more processors 132 and a memory 134. The one or more processors 132 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, a FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 134 can include one or more non-transitory computer-readable storage mediums, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 134 can store data 136 and instructions 138 which are executed by the processor 132 to cause the server computing system 130 to perform operations.
In some implementations, the server computing system 130 includes or is otherwise implemented by one or more server computing devices. In instances in which the server computing system 130 includes plural server computing devices, such server computing devices can operate according to sequential computing architectures, parallel computing architectures, or some combination thereof.
As described above, the server computing system 130 can store or otherwise include one or more machine-learned spiking neural networks 140. For example, the networks 140 can be or can otherwise include various machine-learned models. Example machine-learned models include neural networks or other multi-layer non-linear models. Example neural networks include feed forward neural networks, deep neural networks, recurrent neural networks, and convolutional neural networks. Example networks 140 are discussed with reference to
The user computing device 102 and/or the server computing system 130 can train the networks 120 and/or 140 via interaction with the training computing system 150 that is communicatively coupled over the network 180. The training computing system 150 can be separate from the server computing system 130 or can be a portion of the server computing system 130.
The training computing system 150 includes one or more processors 152 and a memory 154. The one or more processors 152 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, a FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 154 can include one or more non-transitory computer-readable storage mediums, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 154 can store data 156 and instructions 158 which are executed by the processor 152 to cause the training computing system 150 to perform operations. In some implementations, the training computing system 150 includes or is otherwise implemented by one or more server computing devices.
The training computing system 150 can include a model trainer 160 that trains the machine-learned networks 120 and/or 140 stored at the user computing device 102 and/or the server computing system 130 using various training or learning techniques, such as, for example, backwards propagation of errors. In some implementations, performing backwards propagation of errors can include performing truncated backpropagation through time. The model trainer 160 can perform a number of generalization techniques (e.g., weight decays, dropouts, etc.) to improve the generalization capability of the models being trained.
In particular, the model trainer 160 can train the spiking neural networks 120 and/or 140 based on a set of training data 162. In some implementations, the model trainer 160 can performed supervised learning techniques to train the networks based on the training data 162. The model trainer 160 can perform any of the techniques or operations described in the Example Training Techniques section above.
In some implementations, if the user has provided consent, the training examples can be provided by the user computing device 102. Thus, in such implementations, the network 120 provided to the user computing device 102 can be trained by the training computing system 150 on user-specific data received from the user computing device 102. In some instances, this process can be referred to as personalizing the model.
The model trainer 160 includes computer logic utilized to provide desired functionality. The model trainer 160 can be implemented in hardware, firmware, and/or software controlling a general purpose processor. For example, in some implementations, the model trainer 160 includes program files stored on a storage device, loaded into a memory and executed by one or more processors. In other implementations, the model trainer 160 includes one or more sets of computer-executable instructions that are stored in a tangible computer-readable storage medium such as RAM hard disk or optical or magnetic media.
The network 180 can be any type of communications network, such as a local area network (e.g., intranet), wide area network (e.g., Internet), or some combination thereof and can include any number of wired or wireless links. In general, communication over the network 180 can be carried via any type of wired and/or wireless connection, using a wide variety of communication protocols (e.g., TCP/IP, HTTP, SMTP, FTP), encodings or formats (e.g., HTML, XML), and/or protection schemes (e.g., VPN, secure HTTP, SSL).
The computing device 190 includes a number of applications (e.g., applications 1 through N). Each application contains its own machine learning library and machine-learned model(s). For example, each application can include a machine-learned model. Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc.
As illustrated in
The computing device 195 includes a number of applications (e.g., applications 1 through N). Each application is in communication with a central intelligence layer. Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc. In some implementations, each application can communicate with the central intelligence layer (and model(s) stored therein) using an API (e.g., a common API across all applications).
The central intelligence layer includes a number of machine-learned models. For example, as illustrated in
The central intelligence layer can communicate with a central device data layer. The central device data layer can be a centralized repository of data for the computing device 195. As illustrated in
The technology discussed herein makes reference to servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from such systems. The inherent flexibility of computer-based systems allows for a great variety of possible configurations, combinations, and divisions of tasks and functionality between and among components. For instance, processes discussed herein can be implemented using a single device or component or multiple devices or components working in combination. Databases and applications can be implemented on a single system or distributed across multiple systems. Distributed components can operate sequentially or in parallel.
While the present subject matter has been described in detail with respect to various specific example embodiments thereof, each example is provided by way of explanation, not limitation of the disclosure. Those skilled in the art, upon attaining an understanding of the foregoing, can readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, the subject disclosure does not preclude inclusion of such modifications, variations and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. For instance, features illustrated or described as part of one embodiment can be used with another embodiment to yield a still further embodiment. Thus, it is intended that the present disclosure cover such alterations, variations, and equivalents.
This application claims priority to and the benefit of U.S. Provisional Patent Application No. 62/744,150, filed Oct. 11, 2018. U.S. Provisional Patent Application No. 62/744,150 is hereby incorporated by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2019/055848 | 10/11/2019 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62744150 | Oct 2018 | US |