A key part in artificial intelligence and machine learning is the computationally intensive task of matrix multiplication. Matrix multiplication or matrix product is a mathematical operation that produces a matrix from two matrices with entries in a field, or, more generally, in a ring or even a semi-ring. The matrix product is designed for representing the composition of linear maps that are represented by matrices. Matrix multiplication is thus a basic tool of linear algebra, and as such has numerous applications in many areas of mathematics, as well as in applied mathematics, statistics, physics, economics, and engineering. In more detail, if A is an n×m matrix and B is an m×p matrix, their matrix product AB is an n×p matrix, in which the m entries across a row of A are multiplied with the m entries down a column of B and summed to produce an entry of AB. When two linear maps are represented by matrices, then the matrix product represents the composition of the two maps.
Computing matrix products is a central operation in all computational applications of linear algebra. Its computational complexity is O(n3) (for n×n matrices) for the basic algorithm (this complexity is O(n2.373) for the asymptotically fastest known algorithm). This nonlinear complexity means that matrix product is often the critical part of many algorithms. This is enforced by the fact that many operations on matrices, such as matrix inversion, determinant, solving systems of linear equations, have the same complexity. Therefore various algorithms have been devised for computing products of large matrices, taking into account the architecture of computers.
Matrix multiplication is at the heart of all machine learning algorithms and is the most computationally expensive task in these applications. Machine learning implementations may use general-purpose CPUs and perform matrix multiplications in serial fashion. The serial computations in the digital domain together with limited memory bandwidth sets a limit on maximum throughput and power efficiency of the computing system.
The following Detailed Description, Figures, and appended Claims signify the nature and advantages of the innovations, embodiments and/or examples of the claimed inventions. All of the Figures signify innovations, embodiments, and/or examples of the claimed inventions for purposes of illustration only and do not limit the scope of the claimed inventions. Such Figures are not necessarily drawn to scale, and are part of the Disclosure.
A switched capacitor vector-dot product (VDP) engine, referred to herein as a VDP engine or VDP, as depicted in
The X and W inputs can be of variable bit-depth, i.e. 8 bits, 4 bits, 3 bits, and the like. The VDP includes a multitude of sub-circuits, as shown in
The W inputs are loaded into the cross-coupled inverters and stored for computation, as shown in the exemplary
In array 300, each column is shown as including N capacitors in parallel. In order to accumulate across the bit-wise depth of W, the columns 3500-3507 are successively connected through switch matrix 360. There are two additional columns of capacitors disposed at the each end of the array, namely column 3509 positioned to the right of column 3500, and column 3508 positioned to the left of column 3507. The capacitors in array 3509 store the accumulated results as the switching network 360 operates across the array.
Column 3500-3508 are controlled so as to be coupled to or uncoupled from node A by associated switches SW0-SW8 respectively, which in turn, are controlled by the timing logic 390. For example, when switch SW0 is caused to close by timing logic 390, the capacitors in column 3500 are coupled to node A to share and distribute the charges. Capacitors in column 3509 are directly coupled to node A.
The signals S0-S8, mac_clear, and accum_clear supplied by timing logic 390, are used to control switches SW0-SW8 of the capacitor array 300. To perform the accumulation column-wise and across the array, the switches are closed and opened in order, thus sharing their charge with the capacitors disposed in array 3509. For example, when SW0 is closed, the charge in the capacitors of column 3500 is shared with the capacitors in column 3509. As the capacitors in columns 3500 and 3509 have the same capacitance, the charge is divided equally between them. After the charge distribution between the capacitors in columns 3500 and 3509, switch SW0 is opened and switch SW1 is closed. Accordingly, the charges of the capacitors in column 3501 and column 3509 are redistributed. The charges of capacitors in columns 3501 and 3509 are both halved, with column 3501 holding the charge from bit 1 of the array 300, and column 3509 holding half the charge of column 3500, which is further divided by 2. The closing and opening of the switches continues until the final result of the multiplication is achieved in column 3509 by closing switch SW8. After SW8 is opened, the result of the multiplication at node A of column 3509 is supplied to comparator 365. The output of comparator 365, which is either a logic 1 or 0, is supplied to SAR 370 whose output provides an additional signal to timing logic 390.
The timing diagram for an 8-bit W being multiplied by a 2-bit X is shown in
When SW8 is closed the second time, the charge stored on the MSB from the X0 multiplication with W0-7 is halved to scale it appropriately, while being added to the result of X1 multiplied with W0-7. This final result is stored on the MSB of column 3509 and then delivered to the SAR ADC for conversion to the digital domain. With the conversion complete, the entire MAC array is reset through mac_clear while the next N inputs of X value is shifted in to begin its MAC function, starting with bit 0. There is no constraint on the number of bits shifted in for each X input. Shown in
The timing network is flexible as shown in
As described above, C4-C7 could be ignored, but they can store another W input that is 4b deep. The input X could be cycled through again, or another input X entirely could be shifted in, going through the same operation shown in
The flexibility in how the array can be configured and operated advantageously provides enhanced efficiency in performing MAC operations. The sequence, size, and order of the MAC operations may be arbitrarily configured by the timing logic shown in
The present application claims benefit under 35 USC 119(e) of U.S. Patent Application No. 63/615,226, filed Dec. 27, 2023, the content of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63615226 | Dec 2023 | US |