Combinatorial Optimization problems are a particularly important class of computing problem and are prevalent across disciplines, from scheduling and logistics to analysis of physical systems to efficient routing. These problems belong to the NP-Hard and NP-Complete class, where no polynomial time solution exists. This leads to an interest in novel algorithms, architectures, and systems to solve these problems. The Ising Model problem is an example of this type of problem, with foundations in statistical physics. For this reason, the Ising Model has emerged as an efficient way of mapping these problems onto various physical accelerators.
The Boltzmann Machine is a stochastic neural network over binary variables which maps directly onto the Ising model. Because of this, the Boltzmann Machine has received attention for its usage to solve Combinatorial Optimization problems. However, the standard Gibbs Sampling algorithm used with the Boltzmann Machine is computationally expensive due to its sequential nature, and many samples needed to reach convergence. Restricted Boltzmann Machine (RBM) can address some of these problems by introducing a parallel sampling scheme via removing intra-layer connections. Therefore, Restricted Boltzmann Machines have found substantial interest as next generation accelerators for computationally difficult problems.
A parallel architecture for combinatorial optimization is provided. In certain implementations, a parallel stochastic sampling scheme of a restricted Boltzmann machine is mapped onto a parallel processor such as provided by a field programmable gate array (FPGA) or an Application Specific Integrated Circuit (ASIC). Through the described multiplier-free design, a single parallel processor's resources of memory and compute can be used to accelerate combinatorial optimization problems.
A combinatorial optimization accelerator having a parallel architecture for combinatorial optimization includes a memory management system on a parallel processor coupled to a memory of the parallel processor, wherein the memory stores a weight matrix: a sampling engine on the parallel processor, the sampling engine coupled to receive weights of the weight matrix stored in the memory from the memory management system and perform as a restricted Boltzmann machine for a set of inputs using the received weights, wherein the sampling engine includes a dual architecture of a first circuit for updating visible states and a second circuit for updating hidden states; and a probability estimator that receives updated visible states and updated hidden states from the sampling engine.
In a further implementation, the combinatorial optimization accelerator can incorporate parallel tempering. A combinatorial optimization accelerator using parallel tempering includes a plurality of the sampling engines, a corresponding plurality of the probability estimators, and a swap controller. The sampling engines can receive weights of the weight matrix and an inverse temperature parameter. The swap controller calculates swap probabilities and performs swaps between two units, each formed of a sampling engine and corresponding probability estimator, once samples have been updated for that unit. A memory management system can stream copies of the weight matrix to multiple copies of the sampling engines.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
A parallel architecture for combinatorial optimization is provided. The described combinatorial optimization accelerator implementing a parallel architecture can be used in numerous applications. For example, the described accelerator can be used for circuit place and route algorithms that can be part of or used by electronic design automation tools. The maxcut problem and the Ising model may be used in generating automatic place and route results for design layouts of circuits and can assist in cases of 3D integration. As another example, the Ising model and described accelerator can be used for supply chain logistics/last mile delivery such as efficient routing of a fleet of delivery vehicles to pickup packages and to rapidly re-route packages and drivers.
In certain implementations, a parallel stochastic sampling scheme of a restricted Boltzmann machine is mapped onto a parallel processor such as provided by a field programmable gate array (FPGA) or an Application Specific Integrated Circuit (ASIC). Through the described multiplier-free design, a single parallel processor's resources of memory and compute can be used to accelerate combinatorial optimization problems.
The memory management system 110 is coupled to the memory 115 and configures how the memory 115 of the parallel processor system 100 is used. The memory 115 of the parallel processor system 100 can include dynamic random access memory (DRAM), static random access memory (SRAM), or a stacked random access memory (e.g., stacked synchronous dynamic random-access memory (SDRAM), high bandwidth memory (HBM)), and the like.
The chip (e.g., device of the parallel processor system 100) has a certain amount of memory 115 available (e.g., RAM). The memory 115 is used to store a weight matrix. The weight matrix may be symmetric or asymmetric. The memory management system 110 supports the reading and writing to the memory 115, for example, the writing and reading of rows and columns of the matrix. In such a case, the memory management system 110 can read out a row of weights and feed as many rows to the stochastic sampling engine 120 as possible in one clock cycle (which depends on particular implementation).
The memory management system 110 can be used to store the weight matrix used by the stochastic sampling engine 120 in a manner that maintains row and column awareness. That is, the memory management system 110 can track the row and column of a weight when streaming weights of the matrix to the stochastic sampling engine 120. This allows for streaming of rows, columns, diagonals, and/or sub-matrices.
The stochastic sampling engine 120 is coupled to receive weights of the weight matrix and may perform as a restricted Boltzmann machine (RBM) for a set of inputs with the assistance of the memory management system 110.
An RBM is an algorithm for a generative stochastic neural network, formed of two layers, that can be used to learn or describe a probability distribution over a set of inputs. The first layer of the RBM is the visible layer (e.g., the input layer, ν) and the second layer of the RBM is the hidden layer, h. The RBM encodes an exponential family probability distribution over binary variables p(ν, h), which can be represented as follows.
where the energy function E(ν, h) reflects the problem being solved and is simplified to reflect linear synaptic weights with two body interactions.
For example, a bipartite graph for the RBM can be formed from a fully connected Ising model problem and the RBM energy function set to E(υ,h)=υTWh where W is the bipartite graph's weight matrix. The weight matrix encodes the lowest energy states in the original Ising Model problem as the highest probability state in the RBM probability distribution.
The stochastic sampling engine 120, as part of the compute core, performs parallelized updates of the visible/hidden states using a Gibbs Sampling update. That is, Gibbs sampling can be performed on the RBM probability distribution to generate samples of the RBM probability distribution. Gibbs sampling is a type of Markov Chain Monte Carlo.
In detail, each neuron has a stochastic activation function where p(νi=1|h)=σ(wiTh+bi) and σ(x)=(1+e−x)−1, with wi being the ith row vector of the weight matrix, and bi being the bias associated with that neuron. The lack of intra-layer connections allows for each neuron in a layer to be sampled in parallel, creating a massively parallel sampling scheme. Each layer is sampled and the result is passed to the next layer to get the next sample.
As mentioned above, this RBM algorithm and Gibbs sampling can be accelerated through implementation on the described parallel processor system 100. The circuitry of the stochastic sampling engine 120 for the visible updates is able to implement {right arrow over (ν)}=σ(W{right arrow over (h)}+{right arrow over (bν)}). νi=σ(Σjwijhj+bi) and the circuitry for the hidden updates is able to implement {right arrow over (h)}=σ(WT{right arrow over (ν)}+{right arrow over (bh)}), hj=σ(Σiwijνi+bj).
In operation, the memory management system 110 of the parallel processor interfaces with a host system 140 to receive the problem to be solved. The memory management system 110 streams rows (or columns or diagonals) of the weights matrix (such as the weight matrix that may be generated when transforming a fully connected Ising model problem to the bipartite graph structure in the RBM) to the sampling engine 120 utilizing as many memory blocks as is available in the memory 115 on the parallel processor system 100.
The stochastic sampling engine 120 performs parallelized updates of the visible/hidden states using the Gibbs Sampling update (e.g., as implemented by the circuitry described in
The sampling engine 120 performs stochastic sampling based on the weight matrix. The weight matrix forms a probability distribution through stochastic sampling, for example according to the restricted Boltzmann machine probability distribution. The sampling engine 120 performs a matrix vector multiplication, but since the visible updates and the hidden updates are binary values, it is possible to use an AND gate or a digital MUX (or other gate). For non-symmetric matrices, for visible updates, the sampling engine adds up all elements of a row. For hidden updates, the sampling engine adds up all elements of a column. Thus, as the row-by-row accumulation occurs, the hidden updates are accumulating each element of a row and the visible updates are accumulating over each row itself (e.g., adding up each element of a row), such that the column is multiplexed through time.
Referring to
As shown in
As shown in
In some cases, described in more detail with respect to
For each unit of time as time progresses, a new entry (row, column or diagonal) of the weights matrix can be streamed out from the accelerator memory, and used to update the neuron states. Depending on the specific hardware architecture used (ASIC vs. FPGA), different architectures can stream either the columns, rows, or diagonals.
Performance of the described parallel architecture can be further improved by including parallel tempering. Parallel tempering is a method that has been shown to accelerate convergence of sampling from a RBM distribution. The hardware accelerator for stochastic sampling (and combinatorial optimization) can be extended to support parallel tempering by adding an inverse temperature parameter (p), and by allowing for swaps between many copies of the system.
Referring to
Similar to that described with respect to
The chip (e.g., device of the parallel processor system 600) has a certain amount of memory 115 available (e.g., RAM). The memory 115 is used to store a weight matrix, as well as other information used by the system 600. The weight matrix may be symmetric or asymmetric. The memory management system 610 supports the reading and writing to the memory 115 and provides the weights (and other information such as the inverse temperature parameter β) to each of the plurality of stochastic sampling engines (first stochastic sampling engine 620-A, second stochastic sampling engine 620-B in the illustrative example). Memory management system 610 can stream rows, columns, diagonals, and/or sub-matrices of the weight matrix to the stochastic sampling engines.
In operation, the memory management system 610 of the parallel processor interfaces with a host system 140 to receive the problem to be solved. The memory management system 610 streams rows (or columns or diagonals) of the weights matrix (such as the weight matrix that may be generated when transforming a fully connected Ising model problem to the bipartite graph structure in the RBM) to each of the plurality of sampling engines. Each stochastic sampling engine (e.g., first stochastic sampling engine 620-A, second stochastic sampling engine 620-B in the illustrative example) performs parallelized updates of the visible/hidden states using the Gibbs Sampling update, for example, as described in more detail with respect to
Referring to
In operation, each of the temperature parameters (β) can be set before the problem begins, and depend on the weights and problem of interest. The individual neuron updates are then modified for both the visible updates and the hidden updates to be multiplied by the temperature parameter for that RBM copy (see e.g., β parameter multiplier in the first circuit 710 and the second circuit 720 of the first stochastic sampling engine 700-A and in the second stochastic sampling engine 700-B).
For the visible updates this becomes: {right arrow over (ν)}=σ(βk(W{right arrow over (h)}+{right arrow over (bν)})), νi=σ(βk(Σjwijhj+bi)), while the hidden updates similarly include the same beta parameter and is given as {right arrow over (h)}=σ(βk(WT
where P1(x) and P2(x) are the probabilities given by the first model (e.g., first probability estimator 630-A) and second model (e.g., second probability estimator 630-B) respectively, while ν1 and ν2 are the vector outputs (e.g., visible updates) from the first and second model for that sampling step.
Accordingly, a method for combinatorial optimization using parallel tempering of the parallel processor system can include receiving, at the combinatorial optimization accelerator, a problem to be solved from a host system; streaming, by the memory management system, the weights to the sampling engine; receiving, by the sampling engine, a set of inputs and the weights from the memory management system to perform parallel updates of the visible states and the hidden states; receiving, by the probability estimator, the updated visible states and the updated hidden states from the sampling engine to identify states that satisfy certain criteria; and outputting, by the probability estimator, the identified states. The method further includes streaming, by the memory management system, the weights to the second sampling engine; receiving, by the second sampling engine, the set of inputs and the weights from the memory management system to perform parallel updates of corresponding visible states and corresponding hidden states; receiving, by the second probability estimator, the updated corresponding visible states and the updated corresponding hidden states from the second sampling engine to identify states that satisfy certain criteria; receiving, at the swap controller, estimated probabilities from the probability estimator and the second probability estimator; and performing, by the swap controller, swaps of samples for visible states and hidden states between the sampling engine and the second sampling engine based on the estimated probabilities from the probability estimator and the second probability estimator.
The sampling engine and the second sampling engine can each further receive corresponding temperature parameters; and multiply the corresponding temperature parameters with visible updates and hidden updates at that sampling engine.
In some cases, performing, by the swap controller, swaps of samples for visible states and hidden states between the sampling engine and the second sampling engine based on the estimated probabilities from the probability estimator and the second probability estimator comprises: calculating a swapping probability based on a swap rule using the estimated probabilities from the probability estimator and the second probability estimator; and deciding whether to swap the samples based on the swapping probability.
Although the subject matter has been described in language specific to structural features and/or acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as examples of implementing the claims, and other equivalent features and acts are intended to be within the scope of the claims.
This invention was made with government support under Grant No. HR0011-18-3-0004 awarded by the Department of Defense/Defense Advanced Research Projects Agency (DARPA). The government has certain rights in the invention.