The present disclosure relates to methods and devices for techniques for using neuromorphic hardware or spiking neural networks to perform linear algebraic calculations, such as matrix multiplication. More specifically, the present disclosure relates to methods and devices for tailoring neuromorphic or spiking neural network application specific integrated circuits (ASICs) designed to perform linear algebraic calculations, such as matrix multiplication. Thus, the present disclosure relates to constant depth, near constant depth, and sub-cubic size threshold circuits for linear algebraic calculations.
Despite the rapid advances in computer technologies and architectures over the last seventy years, still faster or more powerful forms of computing are desired. Neural computing technologies have been one of several proposed novel architectures to either replace or complement the ubiquitous von Neumann architecture platform that has dominated conventional computing for the last seventy years.
The illustrative embodiments provide for a method of increasing an efficiency at which a plurality of threshold gates arranged as neuromorphic hardware are able to perform a linear algebraic calculation having a dominant size of N. The computer-implemented method includes using the plurality of threshold gates to perform the linear algebraic calculation in a manner that is simultaneously efficient and at a near constant depth. “Efficient” is defined as a calculation algorithm that uses fewer of the plurality of threshold gates than a naïve algorithm. The naïve algorithm is a straightforward algorithm for solving the linear algebraic calculation. “Constant depth” is defined as an algorithm that has an execution time that is independent of a size of an input to the linear algebraic calculation. The near constant depth is a computing depth equal to or between O(log(log(N)) and the constant depth.
The illustrative embodiments also provide for a neuromorphic computer. The neuromorphic computer includes a plurality of threshold gates or spiking neurons configured to compute a specific linear algebraic calculation having a dominant size of N in a manner that is simultaneously efficient and at a near constant depth. “Efficient” is defined as a calculation algorithm that uses fewer of the plurality of threshold gates than a naïve algorithm. The naïve algorithm is a straightforward algorithm for solving the specific linear algebraic calculation. “Constant depth” is defined as an algorithm that has an execution time that is independent of a size of an input to the specific linear algebraic calculation. The near constant depth is a computing depth equal to or between O(log(log(N)) and the constant depth.
The illustrative embodiments also provide for a method of manufacturing a neuromorphic computer tailored to perform a specific linear algebraic calculation. The method includes manufacturing a plurality of threshold gates or spiking neurons. The method also includes arranging the plurality of threshold gates or spiking neurons to compute the linear algebraic calculation having a dominant size N in a manner that is simultaneously efficient and at a near constant depth. “Efficient” is defined as a calculation algorithm that uses fewer of the plurality of threshold gates than a naïve algorithm. The naïve algorithm is a straightforward algorithm for solving the linear algebraic calculation. “Constant depth” is defined as an algorithm that has an execution time that is independent of a size of an input to the linear algebraic calculation. Near constant depth is a computing depth equal to or between O(log(log(N)) and the constant depth.
The novel features believed characteristic of the illustrative embodiments are set forth in the appended claims. The illustrative embodiments, however, as well as a preferred mode of use, further objectives and features thereof, will best be understood by reference to the following detailed description of an illustrative embodiment of the present disclosure when read in conjunction with the accompanying drawings, wherein:
The illustrative embodiments recognize and take into account that, for decades, neural networks have shown promise for next-generation computing. Recent breakthroughs in machine learning techniques, such as deep neural networks, have provided state-of-the-art solutions for inference problems. However, these networks require thousands of training processes and are poorly suited for the precise computations required in scientific or similar arenas. Thus, for example, the illustrative embodiments recognize and take into account that neural networks and neuromorphic computers are poor at performing linear algebraic calculations, generally.
The illustrative embodiments also recognize and take into account that the advent of neuromorphic computing hardware has made hardware acceleration of many functions relevant to complex computing tasks possible. Thus, the illustrative embodiments describe a method to leverage neuromorphic hardware to perform non-trivial numerical computation, such as matrix multiplication or matrix inversion, at lower resource costs than a naïve implementation would provide.
As used herein, an efficient linear algebraic calculation algorithm is defined as being better than a naïve linear algebraic calculation algorithm with respect to a resource of interest. For example, a naïve implementation of matrix multiplication of two square matrices A and B (size N×N) requires O(N3) multiplication operations. As a specific example, multiplying two matrices, each of the size 3×3, would require 9 multiplication operations to solve. This naïve, or straight forward, algorithm is a straight forward multiplication of each row of matrix A with each column of matrix B.
In contrast, an efficient matrix multiplication algorithm uses fewer operations. Likewise, as used herein, an efficient linear algebraic calculation algorithm leverages a non-obvious technique, such as described by Strassen, to perform O(N2+δ) operations, where δ is a number equal to or greater than zero, but less than one. These terms are distinguished from an efficient neuromorphic algorithm, below, though an efficient matrix multiplication algorithm or an efficient linear algebraic calculation algorithm should also be an efficient neuromorphic algorithm. Continuing with the topic of efficient matrix multiplication algorithms, an example of such an algorithm is the Strassen method. In the Strassen method, δ≈0.81. Subsequent improvements upon Strassen's approach have resulted in an efficient algorithm with δ≈0.37 for matrix multiplication, and it is a major open problem in computer science whether an algorithm with δ=0 exists. Although Strassen's multiplication algorithm was designed as a serial algorithm, it can be implemented as an iterative parallel algorithm with a depth (number of steps necessary to compute it) of O(log(N)). The total amount of work the parallel algorithms performs, as measured by the number of arithmetic operations, is O(N2+δ).
As used herein, a constant depth algorithm has an execution time (that is, a number of steps necessary if unlimited parallel processors are available) that does not depend on the size of the input, which is N in the case above. In other words, an algorithm that has a depth of 6 will require 6 steps to compute, regardless of whether the input size N is equal to 10 or 1,000,000. Such algorithms exhibit perfect parallelism, in that the work the algorithm performs can be distributed across many parallel processors with only a constant sized overhead in combining the processing steps. Constant depth parallel algorithms are rare in most practical parallel computing models. The barrier to constant depth algorithms is that, even though work may be distributed across an unlimited number of processors, there is an inherent dependency in a sequence of computational steps; inputs of one step may depend on outputs of others. A perfectly parallel algorithm is able to select and organize its steps so that only a constant number are necessary.
As used herein a “threshold gate” is a computational device comparable to a neuron on a neuromorphic computing platform. Threshold gates are typically Boolean functions (outputting a 1 or a 0). A Boolean threshold gate with m binary inputs y1, y2, y3, . . . ym computes a linear threshold function, outputting 1 if and only if Σi=1mwiyi≥t, where the integer weights wi and integer threshold t are constants associated with the gate. Threshold gates are different from conventional simple logic gates like AND and NOT in that they can have an unbounded number of inputs instead of 2 inputs, and they compare against a tunable threshold as opposed to checking that all inputs are the same or different.
As used herein, the term “neuromorphic hardware” is hardware specialized to perform computation on threshold gates. Neuromorphic hardware may be viewed as a model of parallel computation, where the amount of work performed by an algorithm is the total number of gates required to implement it. Efficiency in the context of neuromorphic algorithms is an “efficient neuromorphic algorithm.” Efficiency in this context refers to the number of gates employed. If a given computation in neuromorphic hardware can be performed by two different algorithms, the algorithm that uses the fewest threshold gates to perform the computation is considered the most efficient. The fewer the gates employed to perform a computation, the more efficient. The depth of a neuromorphic algorithm is, as above, the number of steps required to obtain the output.
Attention is now turned to challenges in designing neuromorphic hardware. In particular, for neuromorphic hardware to be well suited to perform linear algebra, we require efficient implementations be found that are a constant depth when implemented in a threshold gate formulation. The illustrative embodiments describe a method to achieve all of these characteristics simultaneously.
The illustrative embodiments also recognize and take into account another challenge in designing neuromorphic hardware. In particular, achieving either efficient or constant depth implementations of matrix multiplication on neuromorphic hardware is not simple. However, these are relatively straight forward or naïve exercises. For example, a naïve binary matrix multiplication algorithm can be formulated as a simple depth-2 threshold gate circuit. This example is provided below with respect to
However, achieving both efficiency and constant depth is a challenge for the following reasons. First, efficient methods that leverage iterative approaches, such as Strassen, must be converted into a constant depth network. This challenge was not previously recognized. For example, Strassen's iterative algorithm computes seven products for two 2×2 matrices, and then computes the output matrix by taking different linear combinations of these products. This process is then repeated for progressively larger blocks of the original data until the final product is obtained. That is, the entries of the next layer's two input matrices are the outputs of the first set of computations. This process requires O(log(N)) depth, and must be converted to a constant depth. It is difficult to do so while retaining the efficiency of the original approach.
The illustrative embodiments recognize another challenge to performing linear algebraic computations using neuromorphic hardware. Threshold gates perform weighted addition and compare the information to a threshold, however matrix multiplication algorithms require general arithmetic computations of sums and products. As described above, these efficient algorithms for tasks, such as matrix multiplication, require separate stages of computing involving additions and subtractions and involving computing products. These intermediate sums and products can vary in their precision depending on the location within the algorithm. While threshold gates do compute weighted sums in their internal computation, they are not general purpose devices. Therefore, special circuits must be constructed to perform addition and subtraction of two numbers and multiplication of two numbers. Accordingly, required precision for the computation amplifies through any circuit, requiring more sophisticated threshold gate structures to maintain efficiency.
Regardless of whether inputs utilize Boolean representations or not, intermediate stages in the computation have to be able to represent numbers with potentially amplifying precision. For example, the sum of K 1-bit numbers requires O(log(K)) precision to represent. This principle is particularly true for the case of complex sums and products being used in the efficient algorithms considered here. Thus, multiple threshold gates are required to communicate non-binary numbers of growing precision to subsequent stages of the algorithm. Performing this function while maintaining overall efficiency of the algorithm is non-trivial.
The illustrative embodiments successfully address these challenges. Section 3.3, below, works out the math for converting efficient methods that leverage iterative approaches into a constant depth network, for the particular case of matrix-matrix multiplication of two N×N matrices with the precision of O(log(N)) bits for each entry. This example illustrates that our method can achieve a constant depth with only a small penalty on efficiency, see Theorem 8 below, or get arbitrarily close to a constant depth with a comparable efficiency to state-of-the-art non-neural formulations, see Theorem 7, below.
In addition, section 3.2, below, describes the basic arithmetic blocks used as pieces of an overall inventive method. Some of these pieces were optimized for the purposes of the illustrative embodiments, such as Lemma 2, and Corollary 3, below. This technique is non-trivial.
Still further, the discussion in section 3.2 describes considerations around precision and maintaining the constant depth and resource requirements, particularly when dealing with negative numbers. Thus, the illustrative embodiments provide for constant depth, near constant depth, and sub-cubic size threshold circuits for linear algebraic calculations.
Matrix A 100, matrix B 102, and matrix C 104 are all two by two matrices. Thus, each of these matrices can be described as being of “order 2”. Similarly, each of these matrixes can be described as N×N matrices, where N=2.
Multiplying matrix A 100 and matrix B 102 can be done by a straight forward method, also known as a naïve method, a brute force method, or a standard method. The standard method is shown in column A 106 of
Thus, the value of C11 requires performing two multiplications. Calculating the value of the other three cells of matrix C 104 requires six more multiplications. Accordingly, the standard algorithm for finding the solution to multiplying two matrixes, that are each two by two, requires eight multiplications.
More generally, multiplying two matrices of size N requires N3 multiplication operations using the straight forward method. For the above example in
Such techniques exist. For example, the Strassen method of matrix multiplication is shown in column B 108. In the Strassen method, seven block multiplications are performed, instead of 8. Thus, for a two by two matrix only, one less multiplication step is required to solve for matrix C 104. More detail regarding the Strassen method, and generalizations of efficient methods of matrix multiplication, are given below.
As indicated above,
The r-ary tree TA for Strassen's Algorithm (r=7, T=2). For K×K matrices U and V, the notation Uij or (U)ij refers to the (i, j)th K/T×K/T block of U; observe that (U+V)ij=Uij+Vij. Each node children corresponding to the r multiplication expressions Mi. An edge associated with Mi is labeled with the number of terms of A that appear in Mi. Each node u on level h corresponds to a matrix that is a weighted sum of N/Th×N/Th blocks of A. The number of blocks of A appearing in such a sum is the product of the edge labels on the path from u to the root of the tree. For example, (A12−A22)12−(A12−A22)22=(A12)12−(A22)12−(A12)22+(A22)22 is a weighted sum of 4 N/T2×N/T2 blocks of A. The NlogT r leaves of TA correspond to scalars that are weighted sums of entries of A.
As shown in
Attention is now turned to a more scientific and mathematical approach to the issues and solutions described above. As a broad overview, we begin with the observation that Boolean circuits of McCulloch-Pitts nodes are a classic model of neural computation studied heavily in the late 20th century as a model of general computation. Recent advances in largescale neural computing hardware has made their practical implementation a near-term possibility.
The illustrative embodiments describe a theoretical approach for performing matrix multiplication using constant depth threshold gate circuits that integrates threshold gate logic with conventional fast matrix multiplication approaches that perform O(N3-∈) arithmetic operations for constant ∈>0. Dense matrix multiplication is a core operation in convolutional neural network training. Performing this work on a neural architecture instead of off-loading it to a general processing unit may be an appealing option. Prior to the illustrative embodiments, it was not known whether the Θ(N3)-gate barrier was surmountable.
The illustrative embodiments describe the computational power of Boolean circuits where the fundamental gates have unbounded fan-in and compute a linear threshold function. Such circuits are rooted in the classical McCulloch-Pitts neuronal model, with linear threshold functions serving as plausible models of spiking neurons. A Boolean threshold gate with m binary inputs y1, y2, . . . , ym computes a linear threshold function, outputting 1 if and only if Σi=1mwiyi≥t where the integer weights wi and integer threshold t are constants associated with the gate. Rational wi and t may be represented, for example, by multiplying wi and t with a common denominator. There are several natural measures of complexity associated with Boolean circuits, including size: the total number of gates, depth: the length of the longest directed path from an input node to an output node, edges: the total number of connections between gates, and fan-in: the maximum number of inputs to any gate.
Consider threshold circuits with a constant depth and polynomial size with respect to the total number of inputs. This class of circuits is called TC0. Such circuits represent a plausible model of constant-time parallel computing. This is a notion of perfect parallelizability, faster than the polylogarthmic time allowed in the complexity class NC. TC0 circuits can compute a variety of functions including integer arithmetic, sorting, and matrix multiplication. In contrast, constant depth and polynomial size circuits with unbounded fan-in, including other types of natural gates such as AND, OR, and NOT gates cannot compute functions such as the parity of n bits, which can be computed by TC0 circuit of sublinear size. Understanding the power and limitations of TC0 circuits has been a major research challenge over the last couple of decades. The 1990's saw a flurry of results showing what TC0 circuits could do, while more recent results have focused on lower bounds showing what TC0 circuits cannot do.
TC0 has been studied as a theoretical model. Its practicality is an open question. Currently, large-scale electronic circuits with high fan-in may be difficult to implement, however neural-inspired architectures may offer hope. The adult human brain contains about 100 billion neurons, with maximum fan-in of about 10,000 in the cortex and larger in the cerebellum. Though impressive, this figure represents a single class of instance size, so one might wonder how a synthetic system based on the physical limitations governing a brain might scale asymptotically. We are not aware of any generative brain models for which this has been analyzed, however a fan-in that grows with the total system size seems plausible for a 3D circuit, such as the brain. Even if allowing unbounded fan-in, neuron resource requirements may grow as a function of fan-in. For example, there may be growth in energy or time due to a loss in precision. Constant depth, in the TC0 sense, may not equate to constant time, however such ideal algorithms may still guide the development of resource-efficient practical algorithms as neuromorphic architectures become more prevalent.
There is a renewed interest in the complexity of threshold circuits, in part because of recent developments in neural-inspired architectures. While neuromorphic computing has long had a focus on the use of analog computation to emulate neuronal dynamics, recent years have seen rapid development of novel digital complementary metal-oxide semiconductor (CMOS) neural hardware platforms which can scale to very large numbers of neurons. While the inspiration of these architectures in large part came from the desire to provide a substrate for large biologically inspired circuits, they are attracting attention as an alternative to conventional complementary metal-oxide semiconductor (CMOS) architectures for accelerating machine learning algorithms, such as deep artificial neural networks. Many of these neural architectures, such as TrueNorth and the SpiNNaker platform, achieve considerable benefits in energy and speed by using large numbers of simple digital spiking neurons instead of a relatively smaller number of powerful multi-purpose processors. These systems are almost configurable threshold gate circuits, except that they are capable of extended temporal dynamics. Scientific computing is an application domain for which neural architectures are often quickly dismissed. There is a perception that human cognition is better for data-centric functions, such as image recognition, and for abstract decision making than for precise numerical calculations, particularly at a large scale. While biologically-inspired neural algorithms are often probabilistic or approximate, the neuronal-level computations in large scale neural architectures are sufficiently precise for numerical computation.
Consider a fundamental scientific computing-inspired problem: can one produce constant depth threshold circuits that compute the product of two N×N matrices using O(N3-∈) gates for constant ∈>0? For matrices with relatively large entries (say Ω(N) bits), this goal seems out of reach. However, prior to this work, it was not known if this was possible even for binary matrices, those with entries that are all either 0 or 1.
The present disclosure shows how to multiply two N×N matrices with O(log(N))-bit entries using O(N3-∈) gates and constant depth. The results are based on classical breakthroughs for fast matrix multiplication: multiplying two N×N matrices using O(N3-∈) arithmetic operations. The näive algorithm based on the definition of matrix multiplication requires Θ(N3) arithmetic operations. These techniques result in O(log(N))-time conventional parallel algorithms (for architectures such as parallel random-access machine (PRAM)) with O(N3-∈) total work. In contrast, the embodiments present a constant-time algorithm, in the threshold circuit model, with O(N3-∈) total gates, which is a reasonable measure of total work. This procedure is the first use of fast matrix multiplication techniques on an unconventional architecture.
One motivation for neural-circuit-based matrix multiplication is convolutional neural networks for deep learning. See Warden's clear explanation of the role of matrix multiplication in convolution steps for neural networks, which is summarized here. For more details see the Stanford course notes at http://cs231n.github.io. These networks assume the input is a two-dimensional image, with an n×n grid of pixels, each with l channels. The neural networks usually refer to the number of channels as depth, but here, “depth” refers to the number of layers in our circuit. Typically, the number of channels l is a constant, but not necessarily just the three classic color channels (red, green, blue). A convolutional step applies a set of K kernels to the image. Each kernel looks for a particular sub-pattern, such as a horizontal edge or a splash of red. The kernel considers a small constant q×q submatrix of pixels (with l channels) at a time and is applied across the whole image based on a stride. This recognizes the pattern no matter where it is in the image. For example, if the stride is four, then the kernel is applied to every fourth column and every fourth row. A place where the kernel is applied is called a patch. For each patch, for each kernel, a dot product scores the extent to which the patch matches the kernel. Computing all of the kernels simultaneously is a matrix multiplication. The first matrix is P×Q, where P=O(n2) is the number of patches and Q=q×q×l is the number of elements in a kernel. The second matrix is Q×K. This gives a P×K output matrix, giving the score for each patch for each kernel.
Let N be the largest matrix dimension and may use a fast matrix multiplication algorithm that can multiply two N×N matrices in time O(Nω). The circuit requires fan-in as large as O(Nω). These are gates at the end that compute the final output matrix entries. Two of the relevant matrix dimensions for convolutional neural networks, K and Q are generally constants. The third dimension P is not. However, if the particular architecture can only support fan in x, the matrix breaks multiplication into independent pieces, each with at most
rows in the first matrix. These can run in parallel, so they have the same depth, given a large enough architecture. Thus, the unbounded fan-in in the algorithm is not necessarily a practical limitation for the motivating application.
Deep learning is a major motivation for neural-inspired architectures. A current vision requires the matrix multiplication to be moved off-system to a general processor unit. If circuit-based matrix multiplication can be made practical, perhaps this computation can be left on-chip.
Although our results extend to multiplying two N×N matrices with O(log(N))-bit entries, for ease of explanation, the illustrative embodiments give details for binary matrix multiplication. This case illustrates the fundamental ideas of the approach. Matrix multiplication of binary matrices also has applications in social network analysis.
Social networks of current interest are too large for our circuit methods to be practical for neuromorphic architectures in the near future. Also social network adjacency matrices are sparse, unlike the dense small matrices for convolutional neural networks described above. Nevertheless, the illustrative embodiments briefly review the motivation for matrix multiplication in this setting. One application is computing the clustering coefficient of an N-node graph (or subgraph). The global clustering coefficient is the ratio of the number of triangles in the graph to the number of wedges (length-2 paths) in the graph. A degree-δ node is at the center of
wedges. The global clustering coefficient is the fraction of total wedges in the graph that close into triangles. These triangles are common in social networks, where the central node of a wedge may introduce two neighbors.
Social network analysis researchers believe a high global clustering coefficient (also called transitivity) means the graph has a community structure. For example, Seshadri, Kolda and Pinar assumed constant global clustering coefficients when proving a structural property of social networks they used for their block two-level Erdös-Renyi (BIER) generative model. Orman, Labatut and Cherifi empirically studied the relationship between the community structure and the clustering coefficient. They found that high clustering coefficients did imply community structure, although low clustering coefficients did not preclude it.
To simplify our presentation, the illustrative embodiments focus on the question: Does a graph G have at least τ triangles? The user can pick a value of τ that represents a reasonable community structure for their particular kind of graph. Usually they will compute the total number of wedges D in O(N) time and set τ to some function of D (perhaps just scaling by a constant).
There is a simple depth-2 threshold circuit to solve this problem for a graph G=(V,E). The circuit has an input variable, xij for i,j,∈V with i<j; the variable xij is 1 if ij∈E and 0 otherwise. The first layer of the circuit consists of a gate, gijk, for each triple i,j,k,∈V with i<j<k. The gate gijk computes the value of the linear threshold function xij+xik+xjk≥3 as an output yijk. That is, the gate fires (yijk=1) if and only if all edges in the triangle on i, j, and k are in the graph. The second layer includes a single output gate that computes the linear threshold function Σi,j,k∈V:i<j<kyijk≥τ; this gate fires if and only if the number of triangles in G is at least τ. The circuit has
gates.
The illustrative embodiments ask (and answer) whether it is possible to beat the size of this threshold circuit in constant depth. This is akin to asking if it is possible to beat the näive matrix multiplication algorithm with an algorithm that performs O(N3-∈) operations. In fact, the above threshold circuit is a specialization of the näive matrix multiplication algorithm.
The analysis of the new threshold circuits is more involved than analyzing conventional fast matrix multiplication methods. The illustrative embodiments must explicitly consider sparsity (see Definition 1), a measure of how many times a matrix element or intermediate result is part of a computation during the fast multiplication. Thus, while the illustrative embodiments use existing fast matrix multiplication techniques to achieve the results, they are used in a new context. The performance exploits different features of fast matrix multiplication techniques than have been previously used.
Attention is now turned to the results of the analysis and studies. Consider a fast recursive or divide-and-conquer matrix multiplication algorithm like Strassen's with a run-time complexity O(Nω). The illustrative embodiments will consistently use ω as the exponent in the runtime complexity of the base non-circuit fast matrix multiplication algorithm.
The main result is an O(d)-depth, O(d)−depth, Õ(Nω+O(γ
Attention is returned to
Section 2: Fast Matrix Multiplication.
Section 2.1: Strassen's Matrix Multiplication Algorithm.
Strassen developed the first matrix multiplication algorithm requiring O(N3-∈) multiplications. Strassen observed that one can compute the matrix product, C=AB for two by two matrices A and B using seven multiplications rather than the 8 multiplications required by the näive algorithm. The reduction in multiplications comes at the expense of additional additions and subtractions.
The divide-and-conquer nature of Strassen's algorithm lends itself to a natural O(log N)—time parallel (PRAM) implementation with a total work of O(Nlog
Although Strassen's seminal approach was based on a fast matrix multiplication algorithm for two by two matrices, subsequent work has yielded improved algorithms by employing a fast matrix multiplication algorithm involving larger square matrices, as well as more sophisticated techniques. The currently best known algorithm requires O(N2.373) operations. See the survey by Bläser for a detailed introduction to and history of this area.
Section 3: Threshold Circuits for Counting Triangles in Graphs.
Section 3.1: Problem Statement.
Let A be the N×N symmetric adjacency matrix of a graph G=(V, E) with N nodes: for i,j,∈V,Aij=Aij=1 if ij∈E, and Aij=Aij=0 otherwise. Consider the square of the adjacency matrix, C=A2. For i,j,∈V with i≠j,Cij=Σjk∈VAikAkj=|{k∈V|ik∈E and jk∈E}| which is the number of paths of length 2 between i and j. If there is an edge between the nodes i and j, then each path of length 2 between them, along with the edge ij, forms a triangle in G. Moreover, every triangle containing i and j arises in this way.
Suppose G has Δ triangles. Then, 3Δ=Σi,j∈V:i<jAijCij, since the sum counts each triangle once for each of its edges. Thus, one can count the triangles in G by summing some of the entries of A2. An equivalent computation is the trace of A3, trace(A3), which (from (1)) is equal to 6Δ.
The illustrative embodiments employ a threshold circuit implementation of fast matrix multiplication algorithms to compute the sum in constant depth using O(N3-∈) gates. In fact, the exponent of the gate count can be made arbitrarily close to the exponent of the arithmetic operation count for the best possible fast matrix multiplication algorithm.
The illustrative embodiments explain the notion of a fast matrix multiplication algorithm. The illustrative embodiments assume that given an algorithm for multiplying two T×T matrices using a total of r multiplications (for Strassen's algorithm, T=2 and r=7). The illustrative embodiments assume N=T1 for some positive integer 1. As outlined in Section 2.1, this yields a recursive algorithm for computing the product of two N×N matrices, C=AB, using a total of r1=rlog
As with Strassen's algorithm, the illustrative embodiments assume that given a list of r expressions for each of the multiplications, M1, . . . , Mr; the illustrative embodiments view each Mi as an expression involving the T2 different N/T×N/T blocks of A and B. In particular, each Mi is a product of a {−1,1}-weighted sum of blocks of A with a {−1,1}-weighted sum of blocks of B. We also assume the fast matrix multiplication algorithm provides a list of T2 expressions, each representing a N/T×N/T block of C as a {−1,1}-weighted sum of the Mi. Although the illustrative embodiments do not present details here, the techniques can be extended for fast matrix multiplication algorithms in which more general relational weights are employed.
For 1≤i≤r, let ai be the number of distinct blocks of A that appear in the expression Mi, and let bi be defined analogously with respect to B. The illustrative embodiments let ci be the number of expressions for blocks of C in which Mi appears. Definition 1: The illustrative embodiments let sA=Σ1≤i≤rai,sB=Σ1≤i≤rbi, and sC=Σ1≤i≤rci,
The illustrative embodiments define the sparsity of a fast matrix multiplication algorithm as s=max{sA,sB,sC}. Ballard et al. considers sparsity in analyzing and improving the numerical stability of fast matrix multiplication algorithms.
Section 3.2 Basic TC0 Arithmetic Circuits.
The illustrative embodiments first develop the fundamental TC0 arithmetic circuits on which the results rely. The circuits are designed with neuromorphic implementation in mind, and The illustrative embodiments try to favor simple constructions over those that offer the least depth or gate count.
The first circuit follows from a classical technique to compute symmetric functions in TC0 by Muroga from 1959. It is also a special case of a more general result by Siu et al. The illustrative embodiments include a proof to demonstrate the simplicity of the construction.
Lemma 2. Let s=Σiwixi∈[0,2l] be an integer-weighted sum of bits, xi∈{0,1}. The kth most significant bit of s can be computed by a depth-2 threshold circuit using 2k+1 gates. The kth most significant bit of s is 1 precisely when s lies in one of the intervals [i2l-k,(i+1)2l-k) for some odd integer 1≤i≤2k. The first layer of the circuit computes the function yi:=bool(s≥i2l-k) for 1≤i≤2k. The output of the circuit is bool(Σi odd(yi−yi+1)≥1), since yi−yi+1 is 1 if s∈[i2l-k, (i+1)2l-k) and 0 is otherwise.
The illustrative embodiments build upon the above to obtain the primary addition circuit. The result below may be viewed as a slightly stronger version of Siu et al.'s Lemma 2 for the depth 2 case.
Corollary 3. Given n nonnegative integers, each with b bits, their sum can be computed by a depth-2 threshold circuit O(bn) gates.
Proof. For convenience the illustrative embodiments define log2(m) as the least integer 1 such that m<2l. Let s be the sum of the n integers, z1, . . . , zn; s requires at most log2(n(2b−1))≤log2(n)+b bits. First the illustrative embodiments compute the jth (least significant) bit of s, for 1≤j≤b. Let sj=Σi{tilde over (z)}i, where {tilde over (z)}i is obtained from zi by ignoring all but the least significant j bits. Note that sj requires at most log2(n)+j bits, and the jth bit of s and sj are equal. The illustrative embodiments can compute this bit using 2n+1 gates, applying Lemma 2 on sj with k=log2(n)+1. Thus b(2n+1) gates suffice to compute the b least significant bits of s.
The remaining log2(n) most significant bits of s may be computed using O(n log n) gates by appealing to Lemma 2 for each bit. This improved O(n) gates by observing that functions yi computed for k=log2(n) in the proof of Lemma 2 include those required for all of the most significant log2(n) bits of s. Thus the illustrative embodiments need 2log
Lemma 4. (Theorem 2.9, Siu and Roychowdhury [19]) Given n numbers, each with n bits, all the bits of their n2-bit product can be computed by depth-4 threshold circuit with a number of gates that is polynomial in n.
Although the circuits corresponding to the above Lemma are more involved than the others required, the illustrative embodiments will only be multiplying numbers with O(log(N)) bits and can look to simpler alternatives for practical implementation.
The “numbers” in the above Lemmas refer to nonnegative integers, however the results above can be extended to handle negative integers, with a constant-factor overhead in a gate and wire count. In particular, the illustrative embodiments will need to take weighted sums of numbers with weights in {−1, 1}. The illustrative embodiments will represent each integer x as x=x+−x−, where x+ and x− are each nonnegative, and at most one of the two is nonzero. Other representations are possible, but the illustrative embodiments select this one as it makes for a simpler presentation and implementation. Observe that Lemma 2 allows for negative weights wi in the sum s; if s<0, then the circuit will output 0 for each bit. The illustrative embodiments can take two copies of the circuit in parallel, feeding x+ to one and x− to the other. At most, one of the two circuits will output a nonzero answer, hence the illustrative embodiments may obtain s+ and s− for each bit si of s. This extends to Corollary 3 as well. Computing the product of such values will also incur extra overhead as xy=(x+x−) (y+y−)=x+y+x+y−x−y++x−y− will require additional multiplications, however this is a constant-factor overhead. For the sake of exposition, the illustrative embodiments proceed as if only computing positive quantities takes place.
Section 3.3: A Subcubic TC0 Circuit for Matrix Multiplication.
The circuits for matrix multiplication implement a given conventional fast matrix multiplication algorithm in both a depth-efficient and gate-efficient manner. The illustrative embodiments define trees TA and TB for the input matrices A and B, respectively, based on the recursive or divide-and-conquer structure of the fast matrix multiplication algorithm. The nodes in TA represent weighted sums of blocks of A and likewise for TB. The root of TA represents the matrix A, while the leaves represent weighted sums of its entries. See
In a conventional PRAM implementation of a fast matrix multiplication algorithm, all the matrices at each level of TA and TB are computed, and the results are reused. Since there are O(log N) levels, the illustrative embodiments cannot hope to compute all the matrices at each level in a constant depth circuit, however the illustrative embodiments show that one may compute a constant number of levels of TA and TB in a way that allows use of a number of gates that is arbitrarily close to the total work performed by the fast matrix multiplication algorithm.
The illustrative embodiments assume, as in Section 3.1, that a fast matrix multiplication algorithm multiplies two T×T matrices using r multiplications. The illustrative embodiments describe an improved TC0 circuit for computing the values at the leaves of TA. The results extend naturally to computing the leaves of TB. Level h of TA contains rh nodes, each corresponding to an N/Th×N/Th matrix. Moreover, each entry of each matrix at level h is the {1,1}-weighted sum of at most T2h entries of the root matrix, A. Hence if each entry of the integer matrix requires at most b bits, the number of bits required for each entry of a matrix at level h is at most ┌log2(2bT2h)┐=b+┌2h log2 T┐
(2)
For the main results, the illustrative embodiments assume b=O(log N) bits.
The illustrative embodiments give a sub-cubic TC0 circuit for computing C=AB, however the illustrative embodiments first illustrate the main ideas by showing how to compute trace(A3). As mentioned in Section 3.1, this allows for counting triangles in a graph. The bulk of the work lies in showing how to compute the Nlog
The illustrative embodiments select t levels, 0=h0<h1<h2< . . . <ht, and the TC0 circuit computes all of the matrices at these t levels of TA. The illustrative embodiments must compute the leaves of TA, hence ht=logT N. The benefit of computing level hi is that each entry of each matrix at level hi+1 is then a {−1,1}-weighted sum of at most T2(h
The results rely on parameters associated with the fast matrix multiplication algorithm. Recall SA from Definition 1. The illustrative embodiments define α=r/sA and β=sA/T2, and assert that 0<α≤1 and β≥1 (for Strassen's algorithm, α=7/12 and β=3).
Lemma 5. For 1≤i≤t, if the matrices at level hi-1 of TA have been computed, then the matrices at level hi can be computed in depth 2 using O((b+hi-1)αh
Proof. The rh
The illustrative embodiments seek a better bound on the number of such blocks that must be summed to obtain the matrix associated with u. Let size(u) represent this quantity, and let root(u) be the node at level hi 1 on the path from u to the root of TA. Recall that each edge of TA corresponds to one of the fast matrix multiplication expressions, Mi and that ai is the number of distinct blocks of A that appear in Mi (defined in Section 3.1). The quantity size(u) is the product of the ai associated with the edges on the path from u to root(u) (see
where the last equality follows from the multinomial theorem. The illustrative embodiments now bound the number of gates required to compute the matrices at level hi. Since the illustrative embodiments assume the matrices at level hi-1 have been computed, by Corollary 3, each entry of the matrix associated with node u at level hi can be computed using O((b+hi-1)size(u)) gates in depth 2. The illustrative embodiments charge the gate count for u to root(u), and by equations (3) and (2), the number of gates to charged to each node at level hi-1 is O((b+hi-1)sAh
Next, the illustrative embodiments show how to set the hi so that the number of gates required at each level is approximately balanced. This will yield a total gate count that is, at worst, within a factor of t of the gate count for an optimal setting of the hi. The illustrative embodiments will need to assume the number of multiplications our fast T×T matrix multiplication algorithm required is greater than T2. The results, as stated and proven below, do not hold if an optimal fast matrix multiplication algorithm where the number of multiplications, r=T2. The illustrative embodiments set γ=logβ(1/α). Note that 0<γ<1 since r>T2 is equivalent to αβ>1 (for Strassen's algorithm, γ≈0.491).
Lemma 6. Let hi=┌(1−γi)ρ┐, for some ρ>0. Then all the matrices at levels h1, . . . , ht of TA can be computed in depth 2t using O(t(αβ)ρ(b+Log N)N2) gates.
Proof. The illustrative embodiments have hi≤logT N for all 0≤i≤t since the latter is the height of TA. By Lemma 5, level hi can be computed in depth 2 using O((b+log N)αh
from which the claim follows.
The above Lemma establishes a tradeoff in the following sense. The value ρ impacts the total number of gates, however the illustrative embodiments require that ht=logT N, which imposes constraints on t and, consequently, the depth of the circuit. The larger ρ, the smaller t needs to be in order for ht=logT N.
The illustrative embodiments note that the natural strategy of taking hi=i logT N/t yields a weaker result that will be obtained. A comparable weaker result can be obtained directly computing the leaves of TA without computing intermediate levels. However, this results in sums of larger magnitude, and more involved gate-efficient addition circuits are needed of Siu et al. A disadvantage of using the latter circuits is that they require a significantly larger number of wires.
The illustrative embodiments now establish the main theorems by better quantifying the tradeoff between ρ and t. For these theorems the illustrative embodiments assume that given a fast matrix multiplication algorithm and take ω=logT r.
Theorem 7. The illustrative embodiments are given integer τ and an N×N integer matrix A with entries of size O(log N) bits. There is a threshold circuit of depth O(log log N) that determines whether trace (A3)≥τ using O(Nω) gates.
Proof. The illustrative embodiments appeal Lemma 6, setting ρ=logT N. The gate bound follows from (αβ)ρ=(r/T2)log
Thus ht=┌(1−γt)logT N┐=logT N as desired.
This shows that the illustrative embodiments can compute the values corresponding to the leaves at TA and TB in the stated gate and depth bounds. One may see that each entry Cij is a weighted sum of products, Σk∈I
Thus for each product, Pk, the illustrative embodiments want to multiply it with a {−1,1}-weighted sum over entries of A. The illustrative embodiments may compute these weighted sums independently and in parallel with those for A and B using the same techniques. Thus the illustrative embodiments seek to compute Nω products of 3O(log N)-bit numbers, and appeal to Lemma 4 to accomplish this in depth 4 using a total of Õ(Nω) gates.
The illustrative embodiments now prove the main theorem by exhibiting a more refined tradeoff between ρ and t.
Theorem 8. The illustrative embodiments are given integer τ, and N×N integer matrix A with entries of size O(log N) bits, and a positive integer d. There is a threshold circuit depth at most 2d+5 that determines whether trace (A3)≥τ using Õ(dNω+cγ
Proof. As for the previous theorem, the illustrative embodiments appeal to Lemma 6, this time setting ρ=logT N+∈ logαβ N, for constant ∈>0 whose value is given below. The illustrative embodiments have (αβ)ρ=(r/T2)log
The illustrative embodiments set ∈=γd logT(αβ)/(1−γ)>γd logT(αβ)/(1−γd). This implies:
hence the illustrative embodiments may take t<d in Lemma 6 in order to have ht=logT N. The theorem follows from the argument used in the proof of Theorem 7 and taking c=logT(αβ)/(1−γ) (for Strassen's algorithm, c≈1.585).
The illustrative embodiments describe how to compute the entries of C for the more general case of computing the product AB. The illustrative embodiments define a tree TAB with the same structure as TA and TB. Each node of TAB represents the product of the matrices of the corresponding nodes of TA and TB. Hence the root of TAB represents the matrix C=AB, and the leaves represent the Nlog
Lemma 9. For 1≤i≤t, if matrices at level hi of TAB have been computed, then the matrices at level hi-1 can be computed in depth 2 using O((b+hi-1)αCh
Proof. The proof uses a similar analysis to that of Lemma 5. The illustrative embodiments will need new parameters derived from the fast matrix multiplication algorithm. For 1≤j≤T2, use j to index T2 expressions for entries of C, and define ćj as the number of Mi that appear in the expression corresponding to j. For Strassen's algorithm (
The illustrative embodiments assume the matrix products at level hi of TAB have been computed and compute a node u at level hi-1. The matrix at node u is composed of T2δ
The illustrative embodiments obtain the final circuit by using the above Lemma with arguments analogous to Theorem 7 and 8.
Section 3.4: Open Problems.
The main open problem is whether the illustrative embodiments can do matrix multiplication with O(Nω) gates in constant depth. Theorem 7 shows this can be done in O(log log N) depth. Another open question is lower bounds: What is the minimum depth of a threshold circuit for computing matrix products using O(N3-∈) gates? Can one show that a constant depth threshold circuit using O(Nω) gates yields an O(log N) PRAM algorithm with O(Nω) work?
The circuits are L-uniform. Can a stronger uniformity condition be imposed?
One advantage of neural networks is their low energy relative to CMOS-based electronics. One possible energy model for threshold gates is to charge a gate only if it fires. That is, charge a gate one unit of energy for sending a signal if and only if the weighted sum of the inputs exceeds the threshold. What is the energy complexity of these kinds of matrix multiplication circuits? This will depend on the input class.
Method 400 may be characterized as a method of increasing an efficiency at which a plurality of threshold gates arranged as neuromorphic hardware is able to perform a linear algebraic calculation having a dominant size of N. Method 400 may include using the plurality of threshold gates to perform the linear algebraic calculation in a manner that is simultaneously efficient and at a near constant depth, wherein “efficient” is defined as a calculation algorithm that uses fewer of the plurality of threshold gates than a naïve algorithm, wherein the naïve algorithm is a straight forward algorithm for solving the linear algebraic calculation, wherein “constant depth” is defined as an algorithm that has an execution time that is independent of a size of an input to the linear algebraic calculation, and wherein the near constant depth comprises a computing depth equal to or between O(log(log(N)) and the constant depth (operation 402). In one illustrative embodiment, method 400 may terminate thereafter.
Method 400 may be varied. For example, in one illustrative embodiment, the linear algebraic calculation comprises matrix multiplication of two square matrices of size N×N. In another illustrative embodiment, the method of claim 2 wherein the calculation algorithm has a first number of operations of O(N2+δ) compared to the straight forward algorithm having a second number of operations of O(N3), wherein ω comprises a number that is greater than or equal to zero but less than one.
In this case, method 400 may include optional steps. For example, method 400 optionally may also include converting an initial depth of the linear algebraic calculation, wherein the initial depth is log2N, to the near constant depth (operation 404). In addition, when the near constant depth is the constant depth, then converting includes setting the constant depth to a value of at most 2d+5 that determines whether trace (A3)≥τ using Õ(dNω+cγ
In a different illustrative embodiment, the linear algebraic calculation comprises a matrix inversion. In a still different illustrative embodiment, the linear algebraic calculation comprises multiplying at least two matrices in order to count triangles in a graph G.
Still further variations are possible, such as those described above with respect to sections 2 through 3. Other variations are possible, including more, fewer, or different operations. Thus, the claimed inventions are not necessarily limited to the illustrative embodiments described with respect to
Neuromorphic computer 500 may be plurality of threshold gates or spiking neurons 504. Plurality of threshold gates or spiking neurons 504 may be configured to compute a specific linear algebraic calculation having a dominant size of N in a manner that is simultaneously efficient and at a near constant depth, wherein “efficient” is defined as a calculation algorithm that uses fewer of the plurality of threshold gates than a naïve algorithm, wherein the naïve algorithm is a straightforward algorithm for solving the specific linear algebraic calculation, wherein “constant depth” is defined as an algorithm that has an execution time that is independent of a size of an input to the specific linear algebraic calculation, and wherein the near constant depth comprises a computing depth equal to or between O(log(log(N)) and the constant depth.
Neuromorphic computer 500 may also include other components, such as power source 502, bus 506, and memory 508. Neuromorphic computer 500 may also be in communication with von Neumann computer 510 (a more typical computer).
Neuromorphic computer 500 may be varied. For example, the specific linear algebraic calculation comprises a matrix multiplication of two square matrices of size N×N. In this case, the calculation algorithm has a first number of operations of O(N2+δ) compared to the straight forward algorithm having a second number of operations of O(N3), wherein ω comprises a number that is greater than or equal to zero but less than one.
In a related illustrative embodiment, the plurality of threshold gates or spiking neurons is further configured to convert an initial depth of the specific linear algebraic calculation to the near constant depth, wherein the initial depth is log2N. Furthermore, the near constant depth may be the constant depth, and wherein in being configured to convert, the plurality of threshold gates or spiking neurons are further configured to set the constant depth to a value of at most 2d+5 that determines whether trace (A3)≥τ using Õ(dNω+cγ
In another illustrative embodiment, a sub-plurality of the plurality of threshold gates or spiking neurons are dedicated to communicate non-binary numbers that require increasing precision to define during subsequent stages of the calculation algorithm. In still another illustrative embodiment, the specific linear algebraic calculation comprises a matrix inversion. In yet another illustrative embodiment, the specific linear algebraic calculation comprises multiplying at least two matrices in order to count triangles in a graph G.
Still further variations are possible, such as those described above with respect to sections 2 through 3. Other variations are possible, including more, fewer, or different components. Thus, the claimed inventions are not necessarily limited to the illustrative embodiments described with respect to
Method 600 includes manufacturing a plurality of threshold gates or spiking neurons (operation 602). Method 600 may also include arranging the plurality of threshold gates or spiking neurons to compute the linear algebraic calculation having a dominant size N in a manner that is simultaneously efficient and at a near constant depth, wherein “efficient” is defined as a calculation algorithm that uses fewer of the plurality of threshold gates than a naïve algorithm, wherein the naïve algorithm is a straightforward algorithm for solving the linear algebraic calculation, wherein “constant depth” is defined as an algorithm that has an execution time that is independent of a size of an input to the linear algebraic calculation, and wherein the near constant depth comprises a computing depth equal to or between O(log(log(N)) and the constant depth (operation 604). In one illustrative embodiment, the method may terminate thereafter.
Method 600 may be varied. For example, the specific linear algebraic calculation comprises matrix multiplication of two square matrices of size N×N.
Method 600 may include additional operations. For example, optionally, method 600 may include further arranging the plurality of threshold gates or spiking neurons to convert an initial depth of the specific linear algebraic calculation to the near constant depth, wherein the initial depth is log2N (operation 606). In still another illustrative embodiment, method 600 may, in addition to or in place of operation 606, also include further arranging the plurality of threshold gates or spiking neurons to dedicate a sub-plurality of the plurality of threshold gates or spiking neurons to communicate non-binary numbers that require increasing precision to define during subsequent stages of the calculation algorithm (operation 608). In one illustrative embodiment, the method may terminate thereafter.
Still further variations are possible, such as those described above with respect to sections 2 through 3. Other variations are possible, including more, fewer, or different operations. Thus, the claimed inventions are not necessarily limited to the illustrative embodiments described with respect to
Turning now to
Processor unit 704 serves to execute instructions for software that may be loaded into memory 706. Processor unit 704 may be a number of processors, a multi-processor core, or some other type of processor, depending on the particular implementation. A number, as used herein with reference to an item, means one or more items. Further, processor unit 704 may be implemented using a number of heterogeneous processor systems in which a main processor is present with secondary processors on a single chip. As another illustrative example, processor unit 704 may be a symmetric multi-processor system containing multiple processors of the same type.
Memory 706 and persistent storage 708 are examples of storage devices 716. A storage device is any piece of hardware that is capable of storing information, such as, for example, without limitation, data, program code in functional form, and/or other suitable information either on a temporary basis and/or a permanent basis. Storage devices 716 may also be referred to as computer readable storage devices in these examples. Memory 706, in these examples, may be, for example, a random access memory or any other suitable volatile or non-volatile storage device. Persistent storage 708 may take various forms, depending on the particular implementation.
For example, persistent storage 708 may contain one or more components or devices. For example, persistent storage 708 may be a hard drive, a flash memory, a rewritable optical disk, a rewritable magnetic tape, or some combination of the above. The media used by persistent storage 708 also may be removable. For example, a removable hard drive may be used for persistent storage 708.
Communications unit 710, in these examples, provides for communications with other data processing systems or devices. In these examples, communications unit 710 is a network interface card. Communications unit 710 may provide communications through the use of either or both physical and wireless communications links.
Input/output (I/O) unit 712 allows for input and output of data with other devices that may be connected to data processing system 700. For example, input/output (I/O) unit 712 may provide a connection for user input through a keyboard, a mouse, and/or some other suitable input device. Further, input/output (I/O) unit 712 may send output to a printer. Display 714 provides a mechanism to display information to a user.
Instructions for the operating system, applications, and/or programs may be located in storage devices 716, which are in communication with processor unit 704 through communications fabric 702. In these illustrative examples, the instructions are in a functional form on persistent storage 708. These instructions may be loaded into memory 706 for execution by processor unit 704. The processes of the different embodiments may be performed by processor unit 704 using computer implemented instructions, which may be located in a memory, such as memory 706.
These instructions are referred to as program code, computer usable program code, or computer readable program code that may be read and executed by a processor in processor unit 704. The program code in the different embodiments may be embodied on different physical or computer readable storage media, such as memory 706 or persistent storage 708.
Program code 718 is located in a functional form on computer readable media 720 that is selectively removable and may be loaded onto or transferred to data processing system 700 for execution by processor unit 704. Program code 718 and computer readable media 720 form computer program product 722 in these examples. In one example, computer readable media 720 may be computer readable storage media 724 or computer readable signal media 726. Computer readable storage media 724 may include, for example, an optical or magnetic disk that is inserted or placed into a drive or other device that is part of persistent storage 708 for transfer onto a storage device, such as a hard drive, that is part of persistent storage 708. Computer readable storage media 724 also may take the form of a persistent storage, such as a hard drive, a thumb drive, or a flash memory, that is connected to data processing system 700. In some instances, computer readable storage media 724 may not be removable from data processing system 700.
Alternatively, program code 718 may be transferred to data processing system 700 using computer readable signal media 726. Computer readable signal media 726 may be, for example, a propagated data signal containing program code 718. For example, computer readable signal media 726 may be an electromagnetic signal, an optical signal, and/or any other suitable type of signal. These signals may be transmitted over communications links, such as wireless communications links, optical fiber cable, coaxial cable, a wire, and/or any other suitable type of communications link. In other words, the communications link and/or the connection may be physical or wireless in the illustrative examples.
In some illustrative embodiments, program code 718 may be downloaded over a network to persistent storage 708 from another device or data processing system through computer readable signal media 726 for use within data processing system 700. For instance, program code stored in a computer readable storage medium in a server data processing system may be downloaded over a network from the server to data processing system 700. The data processing system providing program code 718 may be a server computer, a client computer, or some other device capable of storing and transmitting program code 718.
The different components illustrated for data processing system 700 are not meant to provide architectural limitations to the manner in which different embodiments may be implemented. The different illustrative embodiments may be implemented in a data processing system including components in addition to or in place of those illustrated for data processing system 700. Other components shown in
In another illustrative example, processor unit 704 may take the form of a hardware unit that has circuits that are manufactured or configured for a particular use. This type of hardware may perform operations without needing program code to be loaded into a memory from a storage device to be configured to perform the operations.
For example, when processor unit 704 takes the form of a hardware unit, processor unit 704 may be a circuit system, an application specific integrated circuit (ASIC), a programmable logic device, or some other suitable type of hardware configured to perform a number of operations. With a programmable logic device, the device is configured to perform the number of operations. The device may be reconfigured at a later time or may be permanently configured to perform the number of operations. Examples of programmable logic devices include, for example, a programmable logic array, programmable array logic, a field programmable logic array, a field programmable gate array, and other suitable hardware devices. With this type of implementation, program code 718 may be omitted because the processes for the different embodiments are implemented in a hardware unit.
In still another illustrative example, processor unit 704 may be implemented using a combination of processors found in computers and hardware units. Processor unit 704 may have a number of hardware units and a number of processors that are configured to run program code 718. With this depicted example, some of the processes may be implemented in the number of hardware units, while other processes may be implemented in the number of processors.
As another example, a storage device in data processing system 700 is any hardware apparatus that may store data. Memory 706, persistent storage 708, and computer readable media 720 are examples of storage devices in a tangible form.
In another example, a bus system may be used to implement communications fabric 702 and may be comprised of one or more buses, such as a system bus or an input/output bus. Of course, the bus system may be implemented using any suitable type of architecture that provides for a transfer of data between different components or devices attached to the bus system. Additionally, a communications unit may include one or more devices used to transmit and receive data, such as a modem or a network adapter. Further, a memory may be, for example, memory 706, or a cache, such as found in an interface and memory controller hub that may be present in communications fabric 702.
The different illustrative embodiments can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment containing both hardware and software elements. Some embodiments are implemented in software, which includes but is not limited to forms such as, for example, firmware, resident software, and microcode.
Furthermore, the different embodiments can take the form of a computer program product accessible from a computer usable or computer readable medium providing program code for use by or in connection with a computer or any device or system that executes instructions. For the purposes of this disclosure, a computer usable or computer readable medium can generally be any tangible apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The computer usable or computer readable medium can be, for example, without limitation an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or a propagation medium. Non-limiting examples of a computer readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk, and an optical disk. Optical disks may include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W), and DVD.
Further, a computer usable or computer readable medium may contain or store a computer readable or computer usable program code such that when the computer readable or computer usable program code is executed on a computer, the execution of this computer readable or computer usable program code causes the computer to transmit another computer readable or computer usable program code over a communications link. This communications link may use a medium that is, for example without limitation, physical or wireless.
A data processing system suitable for storing and/or executing computer readable or computer usable program code will include one or more processors coupled directly or indirectly to memory elements through a communications fabric, such as a system bus. The memory elements may include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some computer readable or computer usable program code to reduce the number of times code may be retrieved from bulk storage during execution of the code.
Input/output devices can be coupled to the system either directly or through intervening Input/output controllers. These devices may include, for example, without limitation, keyboards, touch screen displays, and pointing devices.
Different communications adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Non-limiting examples of modems and network adapters are just a few of the currently available types of communications adapters.
The description of the different illustrative embodiments has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. Further, different illustrative embodiments may provide different features as compared to other illustrative embodiments. The embodiment or embodiments selected are chosen and described in order to best explain the principles of the embodiments, the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.
This invention was made with United States Government support under Contract No. DE-NA0003525 between National Technology & Engineering Solutions of Sandia, LLC and the United States Department of Energy. The United States Government has certain rights in this invention.
Number | Name | Date | Kind |
---|---|---|---|
20040071363 | Kouri | Apr 2004 | A1 |
20050222825 | Feldmann | Oct 2005 | A1 |
20050234686 | Cheng | Oct 2005 | A1 |
Entry |
---|
Blaser, “Fast Matrix Multiplication”, Theory of Computing Library Graduate Surveys, 5 (Dec. 2013), pp. 1-60. |
Esser et al., “Backpropagation for Energy-Efficent Neuromorphic Computing”, Advances in Neural Information Processing Systems, © 2015, pp. 1117-1125. |
Furst et al. “Parity, Circuits, and the Polynomial-Time Hierarchy”, Mathematical Systems Theory, 17(Aug. 1983), pp. 13-27. |
Indiveri et al., “Neuromorphic silicon neuron circuits”, Frontiers in Neuroscience, vol. 5., Article 73 (May 2011), pp. 1-23. |
Kane et al., “Super-Linear Gate and Super-Quadratic Wire Lower Bounds for Depth-Two and Depth-Three Threshold Circuits”, arXiv:1511.07860 v1 [cs.CC], Nov. 2015, 20 pages. |
Khan et al., “SpiNNaker: Mapping Neural Networks onto a Massively-Parallel Chip Multiprocessor”, Proceedings of the International Joint Conference on Neural Networks, Jun. 2008, 8 pages. |
Merolla et al., “A million spiking-neuron integrated circuit with a scalable communication network and interface”, Science, vol. 345, Issue 6197, pp. 668-673. |
Orman et al., “An Empirical Study of the Relation Between Community Structure and Transitivity”, Studies in Computational Intelligence, vol. 424, © 2014, pp. 99-110. |
Seshadhri et al., “Community structure and scale-free collections of Erdos-Renyi graphs”, arXiv:1112.3644v1 [cs.SI] Dec. 15, 2011, Physical Review E, vol, 85, © 2012, 10 pages. |
Sima et al., “General-Purpose Computation with Neural Networks: A Survey of Complexity Theoretic Results”, Neural Computation, vol. 5 (Dec. 2003), pp. 2727-2778. |
Williams, V.V., “Multiplying matrices in O(n2.373) time”, Stanford University, Jul. 1, 2014, 73 pages. |
Anonymous Author(s), “Constant Depth and sub-cubic Size Threshold Circuits for Counting Triangles in Graphs,” submitted to 29th Conference on Neural Information Processing Systems (NIPS 2016). |
Siu, et al., “On Optimal Depth Threshold Circuits for Multiplication and Related Problems,” SIAM Journal on Discrete Mathematics, vol. 7, No. 2 (May 1994), pp. 284-292. |
Siu, et al., “Depth-Size Tradeoffs for Neural Computation,” IEEE Transactions on Computers, vol. 40, No. 12 (Dec. 1991), pp. 1402-1412. |
Strassen, “Gaussian Elimination is not Optimal,” Numerische Mathematik, 13 (Aug. 1969), pp. 354-356. |
Yao, A.C.-C., “Separating the Polynomial-Time Hierarchy by Oracles,” (Preliminary Version), 26th Annual Symposium on Foundations of Computer Science (Oct. 1985), pp. 1-10. |
McCulloch et al., “A Logical Calculus of the Ideas Immanent in Nervous Activity,” Bulletin of Mathematical Biophysics, vol. 5 (1943), pp. 115-133. |
Number | Date | Country | |
---|---|---|---|
20190079729 A1 | Mar 2019 | US |