The invention relates in general to the field of in-and near-memory processing techniques (i.e., methods, apparatuses, and systems) and related acceleration techniques. In particular, it relates to a method of processing data using memory devices having a crossbar array structure, augmented with compute units configured as interleaved switched-capacitor analogue multipliers and adders, where the compute units are operated according to a 3-phase clocking scheme, which is set to perform partial multiplications and additions in the analogue domain according to a certain bit partition.
Matrix-vector multiplications (MVMs) are frequently needed in several applications, such as technical computing applications and, in particular, cognitive tasks. Examples of such cognitive tasks are the training of, and inferences performed with, cognitive models such as neural networks for computer vision and natural language processing, and other machine learning models such as those used for weather forecasting and financial predictions.
MVM operations pose multiple challenges, because of their recurrence, universality, matrix size, and memory requirements. On the one hand, there is a need to accelerate these operations, notably in high-performance computing applications. On the other hand, there is a need to achieve an energy-efficient way of performing them.
Traditional computer architectures are based on the von Neumann computing concept, where processing capability and data storage are split into separate physical units. Such architectures suffer from congestion and high-power consumption, as data must be continuously transferred from the memory units to the control and arithmetic units through physically constrained and costly interfaces.
One possibility to accelerate MVMs is to use dedicated hardware acceleration devices, such as dedicated circuits having a crossbar array configuration. This type of circuit includes input lines and output lines, which are interconnected at cross-points defining cells. The cells contain respective memory devices (or sets of memory devices), which are designed to store respective matrix coefficients. Vectors are encoded as signals applied to the input lines of the crossbar array to perform multiply-accumulate (MAC) operations. There are several possible implementations. For example, the coefficients of the matrix (“weights”) can be stored in columns of cells. Next to every column of cells is a column of arithmetic units that can multiply the weights with input vector values (creating partial products) and finally accumulate all partial products to produce the outcome of a full dot-product. Such an architecture can simply and efficiently map a matrix-vector multiplication. The weights can be updated by reprogramming the memory elements, as needed to perform matrix-vector multiplications. Such an approach breaks the “memory wall” as it fuses the arithmetic-and memory unit into a single in-memory-computing (IMC) unit, whereby processing is done much more efficiently in or near the memory (i.e., the crossbar array).
The following paper forms part of the background art:
The document PA1 discloses techniques of operating a memory device having a crossbar array structure, where the crossbar array structure includes cells interconnecting rows and columns. The cells include memory elements (namely static random-access memory elements, or SRAM elements) storing respective N-bit weights. The memory elements are connected to respective in-memory compute units, or IMCUs. The IMCUs are collocated with the memory elements in the array, as depicted in
Each IMCU first converts an N-bit weight into a proportional voltage using a pipeline of digital-to-analogue converter (DAC) built from N+1 equally sized stages. A switched-capacitor stage then multiplies these voltages with the M-bit digital input activation. Finally, the output voltages that correspond to the different multiplication results are accumulated along each column by means of charge sharing.
In more detail, the interleaved switched-capacitor circuit shown in
The 3-phase clocking scheme used to operate the IMCUs is illustrated in
Thanks to the collocated architecture, the IMCU circuits, and the 3-phase clocking scheme proposed in PA1, the required circuit area, computation time, and power consumption, scale linearly with the bit resolution of both the inputs and the weights.
The document PA2 discloses similar clocking schemes, interleaved switched-capacitor circuits, and crossbar architectures. So, IMCUs configured as interleaved switched-capacitor analogue multipliers and adders are known per se, as well as the 3-phase clocking schemes to operate them.
The document PA3 presents a scalable neural-network inference accelerator based on an array of programmable cores employing mixed-signal in-memory computing, digital near-memory computing, and localized buffering/control. The compute units are operated based on a multi bit-slicing approach, resulting in N×M partial multiplications at each cell. Bit slicing is applied to the input vector elements, which are mapped onto voltage vector inputs to the crossbar array, one at a time. To perform an in-place matrix-vector multiplication, a vector slice is multiplied with a matrix slice, with O(1) time complexity, and the partial products of these operations are combined outside of the crossbar array device through a shift-and-add reduction network.
More generally, various IMC approaches have been proposed. In general, the MVMs can be performed in the digital or analogue domain. Implementations in the analogue domain can show better performance in terms of area and energy-efficiency when compared to fully digital IMCs. This, however, comes at the cost of a limited computational precision.
Physical implementations of analogue IMC circuitry in CMOS (e.g., using SRAM or equivalent memory technology) often rely on switched capacitors circuits where the multi-bit MVMs are executed in a single time step, as in PA1 or PA2. Alternatively, such operations can be performed as a combination of binary operations (i.e., “bit-slicing”) in the analogue domain, followed by analogue-to-digital conversion (using analogue-to-digital converters, or ADCs) and then shift-and-add operations on the partial results, as in PA3. Other approaches rely on SRAM cells, which exploit binary inputs and binary weights. Alternatively, one can also use phase-change memory (PCM) technology for multibit operations, albeit with limited precision.
Each analogue computation mode, i.e., single-bit or multibit analogue computation mode, has its pros and cons. Performing multi-bit operations in a single step in the analogue domain may limit the analogue signal range, incur more noise, and complicate the ADC and DAC design, while a fully bit-sliced mode requires more ADC conversion steps, which, in turn, incurs higher latency and consumes more energy.
After a perusal investigation of the available IMC approaches and related techniques, the present inventors came up with new designs and operation methods of memory devices based on crossbar array structures, which make it possible to reduce analogue compute signal-to-noise ratio requirements, while making full use of pipelining and thus maximizing the system throughput.
According to a first aspect, the present invention is embodied as a method of processing data. The method relies on a memory device having a crossbar array structure. The latter includes K×L cells, which interconnect K rows and L columns. The cells include respective memory systems, which store respective N-bit weights. The memory systems are connected to respective compute units, which are configured as interleaved switched-capacitor analogue multipliers and adders. According to the proposed method, input signals encoding respective M-bit input words are synchronously applied to respective ones of the K rows. The compute units are operated according to a 3-phase clocking scheme, with a view to obtaining MAC results for each of the L columns, where K≥2, L≥2, N≥2, and M≥2.
Remarkably, the 3-phase clocking scheme is here set to perform n×m partial multiplications, in the analogue domain (i.e., as analogue operations), according to a specific bit partition, so as to obtain n×m partial output signals in output of each of the compute units. This partition decomposes each of the N-bit weights into n groups of bits and each of the M-bit input words into m groups of bits. Each of the n groups and the m groups includes at least one bit. However, at least one of the n groups and/or the m groups includes at least two bits, whereby N+M>n +m≥3. Moreover, the MAC results are obtained by summing the partial output signals obtained by the compute units for each of the L columns. The summed output signals are converted into digital signals encoding partial values. The partial values are shifted according to corresponding bit positions, which are set in accordance with the bit partition, and the shifted values are finally added, so as to recompose the desired output vector components.
The present approach relies on a specific bit partition, which can be regarded as resulting in a granular bit slicing. This proposed solution reduces the analogue compute signal-to-noise ratio requirements. At the same time, the proposed approach can maintain the pipeline behaviour of the system, yet without impacting the throughput.
In embodiments, the granularity of the bit partition of the N-bit weights and the M-bit input words is asymmetric. That is, an average number of bits of the n groups differs from an average number of bits of the m groups. Even if the groups do not need to have a same number of bits, simpler implementations are nevertheless achieved by imposing each of the n groups to have a same number v of bits and, similarly, each of the m groups to have a same number μ of bits, though v may differs from μ. For example, the bit partition may be designed so as to decompose each of the N-bit weights into n groups of y bits, such that N=n×v, where v≥2, and each of the M-bit input words into a single group of M bits, whereby m=1. A preferred variant is to decompose each of the M-bit input words into m groups of μ bits, such that M=m'μ, where μ≥2, and each of the N-bit weights into a single group of N bits, whereby n=1. This allows an easier operation of the compute unit.
In embodiments, the compute units are collocated with the respective memory systems to which they are connected and form part of the respective cells. Thus, the n×m partial multiplications are efficiently performed as in-memory operations in the memory device in that case. In variants, the compute units are arranged in a near-memory (analogue) processing unit, as discussed below.
The MAC results are typically obtained via a readout circuitry, which includes analogue-to-digital converters (ADCs) and digital shift-and-adder circuits. The ADCs are connected to respective columns of the compute units for converting the partial output signals as summed for each of the L columns into the digital signals. Note, the compute units form columns, whether collocated with the memory systems in the cells, or not. The digital shift-and-adder circuits are connected in output of respective ones of the ADCs for shifting the partial values and adding the shifted values. The readout circuitry is preferably co-integrated with the crossbar array structure in the memory device.
In preferred embodiments, the compute units are operated thanks to first control signals, which include 3-phase signals for implementing the 3-phase clocking scheme. The MAC results are obtained by applying second control signals, which are in phase with the 3-phase signals, so as to enable a synchronous operation of the compute units and the readout circuitry. The second control signals include first activation signals and second activation signals. The first activation signals activate the ADCs for converting the partial output signals. The second activation signals activate the digital shift-and-adder circuits for shifting the partial values and adding the shifted values.
Preferably, the 3-phase clocking scheme spans a sequence of clock cycles, wherein the sequence decomposes into M sets of clock cycles associated with respective M bits of the M-bit input words. The 3-phase signals are repeatedly applied, M times, during the M sets of clock cycles. Each of the M sets includes three clock cycles, during which the 3-phase signals are successively applied, such that only one phase signal of the 3-phase signals is applied during a single one of the three clock cycles.
In embodiments, each memory system of the memory systems of each cell of the K×L cells consists of N serially-connected memory elements, each storing a respective bit of one of the N bits of the N-bit weights that is stored in said each cell. A last memory element of the memory elements of each memory system is configured to receive a respective signal of the applied signals, the respective signal encoding a sequence of M bits. Preferably, each compute unit comprises a set of charge adding units and a corresponding set of switching logics. Namely, each of the compute units comprises N charge adding units, which are connected to respective ones of the N serially-connected memory elements via respective switching logics.
In preferred embodiments, the method further comprises performing one or more further operations based on the MAC results obtained, thanks to a near-memory digital processing unit connected in output of the readout circuitry, which allows efficient computing for technical computing applications such as machine learning.
Preferably, the method further comprises optimizing bit cardinalities of the n groups of bits and the m groups of bits with respect to computational precision, latency, and/or energy consumption.
According to another aspect, the invention is embodied as a hardware processing apparatus. The apparatus comprises a memory device and an electronic circuit. The memory device has a crossbar array structure including K×L cells interconnecting K rows and L columns. The cells include respective memory systems storing respective N-bit weights. The apparatus includes K ×L compute units, which may advantageously form part of the cells. The compute units are connected to respective ones of the memory systems of the K×L cells. Again, the compute units are configured as interleaved switched-capacitor analogue multipliers and adders. Consistently with the present methods, the electronic circuit is configured to synchronously apply input signals encoding respective M-bit input words to respective ones of the K rows, operate the compute units according to a 3-phase clocking scheme, and obtain MAC results for each of the L columns, where K≥2, L≥2, N≥2, and M≥2.
Moreover, the electronic circuit is further configured to set the clocking scheme to perform n ×m partial multiplications (in the analogue domain) according to a specific bit partition. As explained above, this partition decomposes each of the N-bit weights into n groups of bits and each of the M-bit input words into m groups of bits, wherein each of the n groups and the m groups includes at least one bit, but at least one of the n groups and/or the m groups includes at least two bits, whereby N+M>n+m≥3. The aim is to obtain n×m partial output signals by each of the compute units. Moreover, the electronic circuit is further configured to obtain the MAC results by summing the partial output signals obtained by the compute units for each of the L columns, converting the summed output signals into digital signals encoding partial values, shifting the partial values according to corresponding bit positions set in accordance with the bit partition, and adding the shifted values to recompose the desired output vector components.
As said, the compute units are preferably collocated with the memory systems to which they are connected and form part of the respective cells, whereby the n×m partial multiplications are performed in-memory, in operation. In variants, the apparatus further comprises a near-memory processing unit, where the latter includes the compute units. The near-memory processing unit may possibly be co-integrated with the crossbar array structure in the memory device.
Preferably, the electronic circuit includes a readout circuitry, which comprises ADCs and digital shift-and-adder circuits. The ADCs are connected in output of respective columns of the compute units, to convert the n×m partial output signals into the digital signals that encode said partial values, in operation. The digital shift-and-adder circuits are connected in output of respective ones of the ADCs to shift the partial values according to corresponding bit positions set in accordance with the bit partition, and add the shifted values, in operation. The readout circuitry is preferably co-integrated with the crossbar array structure in the memory device.
In embodiments, each of the memory systems of the cells includes serially connected memory elements, e.g., static random-access memory elements. The memory elements are designed to store respective bits of a respective one of the N-bit weights, in operation.
Preferably, the electronic circuit further includes an input unit (configured to apply said input signals), as well as control components. The latter are configured to operate the compute units by applying first control signals. The latter include 3-phase signals for implementing the 3-phase clocking scheme. The control components are further configured to operate the readout circuitry to obtain the MAC results. In operation, this is achieved by applying second control signals in phase with the 3-phase signals. The second control signals include first activation signals to activate the ADCs for converting the partial output signals. They further include second activation signals to activate the digital shift-and-adder circuit for shifting the partial values and adding the shifted values.
In preferred embodiments, the apparatus further includes a near-memory digital processing unit, which is preferably cointegrated with the crossbar array structure. The near-memory digital processing unit is connected in output of the readout circuitry and configured to perform operations based on the MAC results obtained at the readout circuitry.
According to another aspect, the invention is embodied as a computing system, which includes one or more hardware processing apparatuses as described above. Preferably, the computing system further comprises a memory unit and a general-purpose processing unit that is connected to the memory unit to read data from, and write data to, the memory unit. Each of the hardware processing apparatuses is configured to read data from, and write data to, the memory unit. The general-purpose processing unit is configured to: map a given computing task to vectors and weights; instruct to store said weights as N-bit weights in the cells of any of the hardware processing apparatuses; and instruct to apply input signals encoding vector components of such vectors as M-bit input words to rows of any of the hardware processing apparatuses, so as to perform such a computing task, in operation.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings. The illustrations are for clarity in facilitating one skilled in the art in understanding the invention in conjunction with the detailed description. In the drawings:
The accompanying drawings show simplified representations of devices or parts thereof, as involved in embodiments. Technical features depicted in
Apparatuses, systems, and methods, embodying the present invention will now be described, by way of non-limiting examples.
The following description is structured as follows. General embodiments and high-level variants are described in section 1. Section 2 addresses particularly preferred embodiments and technical implementation details. The present method and its variants are collectively referred to as the “present methods”. All references Sn refer to methods steps of the flowcharts of
In reference to
The crossbar array structure 15, 15a includes K×L cells 155, 155a. In the present document, each cell is defined as a repeating unit that interconnects a row and a column. I.e., the cells interconnect K rows and L columns, where K≥2 and L≥2. In
Each cell 155, 155a includes a respective memory system 157, see
The memory systems 157 are connected to respective compute units (CUs) 1552, 1552a. The CUs may possibly be collocated with the memory systems 157 (i.e., within the crossbar array structure 15, as shown in
As in typical IMC architectures, the matrix elements that are stored in the memory systems remain stationary (at least during a given MVM calculation cycle), whereas processing occurs via the CUs. Specifically, the stationary matrix elements (i.e., the weights) are stored in the array of memory systems, while input vector components are fed from the outside to the L rows, as illustrated in
The present memory devices 10, 10a are operated as follows. Input signals are synchronously applied to respective rows of the crossbar array 15, 15a, which corresponds to step S50 in the flow of
However, by contrast with the 3-phase clocking scheme used in the documents PA1 and PA2, here the 3-phase clocking scheme is set to perform partial multiplications in the analogue domain according to a specific bit partition, which can be regarded as a granular bit slicing, involving multi-bit analogue operations. That is, the CUs are operated to perform n×m partial multiplications, so as to obtain n×m partial output signals in output of each of the CUs. This partition decomposes each of the N-bit weights into n groups of bits. Similarly, it decomposes each of the M-bit input words into m groups of bits. Still, the numbers (n and m) of groups are subject to certain constraints, which depart from the schemes proposed in the documents PA1-PA3. Namely, each of the n groups and the m groups includes at least one bit but at least one of the n groups and/or at least one of the m groups includes at least two bits, hence the granular bit slicing evoked above.
In more detail, the numbers (n and m) of groups are subject to the constraints N+M>n+m>3. According to the above definitions, at least one of the n and m groups includes more than one bit, whereby one has either 1<n and 1≤ m, or 1≤n and 1<m. In addition, there are at most N+M-1 groups in total, such that N+M>n+m≥3. The n groups do not need to have the same number of bits as the m groups. For example, in preferred embodiments, m is strictly less than M but strictly more than 1 (e.g., m=2), while n=1. Conversely, n may be strictly less than N but larger than 1, while m=1.
Plus, the number of bits can vary in each of the n groups and/or each of the m groups. That is, the number of bits can vary from one of the n groups to the other, and/or from one of the m groups to the other. The partition can actually be optimized against specific applications, this corresponding to step S20 in the flow of
In the present context, the MAC results are obtained S70-S74 column-wise, in three steps. First, the partial output signals obtained by the CUs for each of the L columns are summed, which operation results from the CU design. The summed output signals are converted S72 into digital signals. The converted signals encode partial values. The latter are shifted S74 according to their corresponding bit positions. I.e., such positions are set in accordance with the bit partition used. Finally, the shifted values are added S74, which leads to the desired result, i.e., a vector component yj, where j=1, . . . , L, see the example of
Comments are in order. In the present context, cells 155, 155a should be distinguished from mere memory systems 157, inasmuch as the cells are connected to CUs 1552, 1552a. The reference 1552 refers to CUs that are collocated with the memory systems 157 in the array 15, as illustrated in
The bit partition used causes the CUs to perform multibit multiplications as a series of multi-binary multiplication steps. Instead of performing purely binary bit multiplications (as in PA3), at least some of the multiplications involves groups of several bits. That is, a certain granularity is exploited to optimize performance of the MAC operations, by contrast with the solution proposed by PA1, PA2, and PA3. The present bit partitions cause to decompose the multiplication of an input word and a weight as n×m partial multiplications, based on n groups of bits stemming from the stored weight and m groups of bits representing the input word. In order words, the signals resulting from the partial multiplications are formed as n×m partial output signals, for each cell.
If the CUs are internal (i.e., collocated with the memory systems, as in
The scheme is logically similar when the CUs are external (yet connected to the respective memory systems 157), except that data exchanges occur over slightly larger distances, i.e., between the crossbar array 15a and the unit 19 in
As noted earlier, the underlying device 10, 10a is operated in a synchronous manner, whereby the CUs 1552, 1552a are operated synchronously with the input signals applied. The MAC results are finally obtained by shifting and adding the converted values synchronously with the operation of the CUs. To that aim, use can be made of in-phase control signals.
The above operations may possibly be complemented by further operations executed by a near-memory digital processing unit 17, 17a, connected in output of the readout circuitry 16, 16a.
In particular, the present methods may further comprise performing S80 one or more further operations based on the MAC results obtained at step at step S74, thanks to such a near-memory digital processing unit 17, 17a, as assumed in the flow of
The underlying device (or apparatus) 10, 10a typically includes an electrical input unit 11 to apply input signals to the input lines forming the rows, as well as other components (e.g., control units, pre-/post-processing units, etc.), which are preferably co-integrated in a single device. Such a device (or apparatus) concerns another aspect of the invention and may notably be used in a computerized system, which concerns a further aspect. These other aspects are addressed later.
To summarize, the present methods describe an analogue MVM implementation for multi-bit weights and inputs, where the analogue multiplication of weights and inputs are performed at a granularity of a defined number of bits at a time. The underlying architecture, which relies on CUs that are configured as interleaved switched-capacitor analogue multipliers and adders, allows an optimized pipeline operation mode. Unlike the multi bit-slicing scheme used in PA3, the presented invention can make full use of pipelining and thus maximize the system throughput.
To fix ideas, PA3 can be regarded as involving N x M partial multiplications at the cells (where N=4 and M=4). These operations consist of single bit operations, which do not involve any group, unlike the present bit partition. Conversely, the operations performed in the documents PA1 and PA2 can be regarded as involving a single multiplication (m=1 and n=1); the notion of groups and partition are absent in that case. On the contrary, the present approach institutes a bit partition, which results in a granular bit slicing. As it can be realized, this granular bit slicing reduces the analogue compute signal-to-noise ratio (SNR) requirements. At the same time, the proposed approach can maintain the pipeline behaviour of the system (which requires adjusting the pulse modulation scheme), yet without impacting the throughput.
Another aspect of the invention concerns a hardware processing apparatus 10, 10a. Several features of the apparatus have already been described above in reference to the present methods, be it implicitly. Such features are only briefly described in the following.
To start with, the apparatus includes a memory device 10, 10a such as described above. The apparatus notably includes CUs 1552, 1552a, which may form part of the cells, or not. In all cases, the CUs are connected to respective memory systems 157 of the cells and are configured as interleaved switched-capacitor analogue multipliers and adders. Moreover, the apparatus includes an electronic circuit, which is configured to synchronously apply input signals encoding M-bit input words to respective rows, operate the CUS 1552, 1552a according to a 3-phase clocking scheme, and obtain MAC results for each of the columns, as discussed above. Consistently with the present methods, the electronic circuit is further configured to set the clocking scheme, so as for the CUs to perform partial multiplications in the analogue domain according to a specific bit partition, which results in the granular bit slicing described above. The partial multiplications are performed on continuous analogue signals, using analogue processing, as opposed to digital signal processing. For completeness, the electronic circuit causes to obtain the MAC results by: (i) summing the partial output signals obtained by the CUs 1552, 1552a for each column; (ii) converting the summed output signals into digital signals encoding partial values; and (iii) shifting the partial values according to corresponding bit positions (which are set in accordance with the bit partition) and adding the shifted values. As discussed earlier, the CUS 1552 may advantageously be collocated with the memory systems 157, as assumed in
The near-memory processing unit 19 is preferably co-integrated with the crossbar array structure 15, 15a. The apparatus 10, 10a may further includes additional units, e.g., an input unit 11, a readout circuitry 16, 16a, and a near-memory digital processing unit 17, 17a. In addition, the apparatus 10, 10a will likely include an input/output unit 18, to interface the apparatus with external computers (not shown in
In general, one or more, possibly all, of the above units 11, 17, 17a, 18, 19 may be co-integrated with the crossbar arrays of the devices 10, 10a. So, the apparatus 10, 10a may possibly be embodied as a single, integrated device 10, 10a, should all involved components be co-integrated with the crossbar array 15, 15a. Note, in that respect, the devices 10, 10a shown in
In embodiments, each memory system 157 of the cells 155, 155a includes serially connected memory elements 1551. The memory elements are designed to store respective bits of a respective N-bit weights, in operation. Preferably, the memory elements 1551 are SRAM elements 1551. Besides SRAM elements, however, other memory technologies can be contemplated, such as technologies relying on sense amplifiers (SA). In particular, the memory elements may be dynamic random-access memory (DRAM) elements. SAs are used to perform local read operations. The SAs do typically not need to have adjustable threshold levels; one single threshold is sufficient to detect zeros or ones. In variants, however, the SAs may have adjustable threshold levels, so as to be able to read several levels. More generally, use can be made of volatile or nonvolatile memory technology. In particular, the memory elements may be binary phase-change memory (PCM) elements, magnetoresistive random access memory (MRAM), or resistive-random access memory (ReRAM). All such memory elements can potentially be used in conjunction with CUs 1552, 1552a described above to provide multibit MAC computing capabilities.
A final aspect concerns a computing system 1, such as depicted in
In addition, the computing system 1 may typically include a memory unit 2 and a general-purpose processing unit 2, which is connected to the memory unit to read data from, and write data to, the memory unit. In the example of
Each hardware processing apparatus 10 in the system 1 is configured to read data from, and write data to, the memory unit 2. Client requests are managed by the general-purpose processing unit 2, which is notably designed to map a given computing task to vectors and weights. Note, the system 1 may in fact includes a memory system composed of several memory units. Similarly, the system may include several processing units.
The processing unit 2 is notably configured to instruct to store S30 weights as N-bit weights in the cells 155 of any of the hardware processing apparatuses 10, 10a involved in the system 1. For completeness, the processing unit 2 can instruct to apply S50 input signals encoding vector components of vectors as M-bit input words to rows of any of the hardware processing apparatuses, with a view to performing a computing task. The system 1 may for instance be a composable disaggregated infrastructure, which may include hardware devices 10, 10a as described above along with other hardware acceleration devices, e.g., application-specific integrated circuits (ASICs) and/or field-programmable gate arrays (FPGAs), amongst other possible examples.
Each of the above aspects is now described in detail, in reference to particular embodiments of the invention. The following notably describes preferred bit partitions (subsection 2.1), hardware processing apparatuses and memory devices (subsection 2.2), architectures of interleaved switched-capacitor analogue multipliers and adders (subsection 2.3), phase signals and 3-phase clocking schemes (subsection 2.4), and an example of high-level flow of operation (subsection 2.5).
The granularity of the bit partition of the N-bit weights and the M-bit input words can be asymmetric. That is, the average number of bits of the n groups may differ from the average number of bits of the m groups. In general, the n groups do not need to have a same number of bits, neither do the m groups. The bit distributions can possibly be optimized with respect to the desired application. That is, the present methods may attempt to optimize S20 bit cardinalities of the n groups of bits and the m groups of bits. Such an optimization may for example be performed with respect to computational precision, latency, and/or energy consumption. In some cases, one may want to favour precision (e.g., when accurate vector-matrix multiplications are needed), while applications resilient to precision (e.g., machine learning) may require optimization of latency or energy consumption. Joint optimizations (e.g., against both precision and energy consumption) may further be contemplated, depending on the end user needs.
Even if the groups do not need to have a same number of bits, simpler implementations are achieved by imposing each of the n groups to have a same number v of bits and, similarly, each of the m groups to have a same number u of bits. Still, v will preferably differ from u. For example, each of the n groups (assuming n>2) may include 2 bits, while the M-bit input words may each be processed as a single group of M bits, i.e., m=1. In that case, only two parameters must be optimized, i.e., v and u. Generalizing the above example, the bit partition may possibly be designed to decompose each of the N-bit weights into n groups of v bits, such that N=n×v, where v>2, while each of the M-bit input words is processed as a single group of M bits (m =1).
In practice, however, grouping the N bits (i.e., imposing n=1) allows an easier CU design, compared to grouping the M bits. In that case, each of the M-bit input words is decomposed into m groups of u bits, such that M=mx u, where μ>2, while each N-bit weight is processed as a single group of N bits (n=1). An example of such an implementation is shown in
As a final remark, it should be noted that the present methods may possibly use schemes that purposely drop bits, if necessary, independently of the chosen bit partition.
As seen in
The example of device 10 shown in
Every CU 1552 in a particular column produces n×m partial output signals that are individually summed in the analogue domain. That is, each of the n×m partial signals is summed with a corresponding one of the n×m partial signals produced by the previous CU in the same column (except, of course for the very first CU in that column). Accordingly, n×m partial, accumulated signals are obtained in output of each column. Such output signals are then converted to digital signals by a corresponding ADC 161, prior to being shifted and added via the component 162. The conversion, shift, and add operations, occur in output of each column. In less preferred variants, intermediate conversions may possibly be performed, e.g., at the level of each cell or each subset of cells. This, however, requires adding ADC converters in output of (subsets of) cells concerned, as noted earlier.
In variants, the CUS 1552a may form part of a near-memory processing unit 19, which is preferably co-integrated with the crossbar array structure 15a, to form a device 10a. In both cases, the ADCs 161 are connected to respective columns of the CUs, i.e., whether collocated with the memory systems or not. Thus, the operations remain the same, logically speaking, except that signals must be conveyed over slightly larger distances in the example of the device 10a. Operations performed in the near-memory processing unit 19 are still performed as analogue operations, contrary to operations performed by the near-memory digital processing unit 17, 17a.
As further seen in
As illustrated in
Each CU 1552 includes charge adding units (capacitors in the example of
The last memory element 1551 (corresponding to CN in
Each switching logic is configured such that the corresponding capacitor can be pre-charged or charged (e.g., from another capacitor) in response to the application of a clock signal at the switching logic. In addition, each switching logic can connect its respective capacitor to its respective memory element in response to another clock signal applied at the switching logic. Beyond the operation of the compute units shown in
The CUs 1552, 1552a are operated thanks to a 3-phase clocking scheme, which is similar to the schemes presented in PA1 and PA2, subject to differences that are discussed now in detail. That is, the control signal scheme is here adapted to the bit partition used, as well as to the shift-and-add operations.
Several types of control signals can be involved. The CUs 1552, 1552a can notably be operated S60 thanks to first control signals, which include the 3-phase signals (noted Ø0, Ø1, and Ø2 below) used for implementing the 3-phase clocking scheme, which is similar to the scheme discussed in PA1.
In detail, and as seen in
Note, however, that the very first set of the M sets of clock cycles (corresponding to the set i1 in
In addition to the first control signals, second control signals may be used to obtain S70-S74 the MAC results. As reflected in the flow of
In embodiments, the second control signals includes signals noted ØMSB,add, ØMSB,rst, Øout,add, ØADC, Ørst, and ØSAA. These decompose into input-bit dependent signals (ØMSB,add, ØMSB,rst, and Øout,add) and group-dependent signals (ØADC, Ørst, and ØSAA). Note, ØADC corresponds to the signal noted ØSMP in PA1.
While the periodicity of the input-bit dependent signals matches that of the first control signals, the periodicity of the group-dependent control signals does differ. Specifically, the group-dependent control signals span a sequence of clock cycles, whose sequence decomposes into m sets of clock cycles. The example in
The signals ØMSB,rst and ØMSB,add work as in PA1. They are applied to respectively discharge the capacitor CN (see
In the present context, however, the second control signals include the additional, group-dependent signals ØADC, ØSAA, and Ørst. The latter include two types of activation signals, hereafter called first activation signals (noted ØADC) and second activation signals (noted ØSAA). The first activation signals ØADC are applied to activate S72 the ADCs 161, for the ADCs to convert the partial output signals into digital signals. The second activation signals ØSAA are used to activate S74 the digital shift-and-adder circuits 162, for the latter to shift the partial values and add the shifted values. The signal Ørst is used to reset the output capacitors' voltage VC,out to 0 (corresponding to the output capacitors noted Cout,1 to Cout,K in
The operation of the activation signals is as follows. As seen in
Every time the input bits of an input-bit group have been processed, the three signals ØADC, ØSAA, and Ørst are strobed one by one—for m groups of input bits this happens m times, after which the operation is completed. Note, the position of input bits and weight bits can be swapped for grouping weight bits instead of input words. As noted earlier, the signals ØSAA and Ørst are applied in-phase with the 3-phase signals.
Additional signals may be used, which are not shown in
As in PA1, some signals are common for the entire array 15, for instance the 3-phase signals Ø0, Ø1, Ø2, as well as Øout,add. Other signals, such as the signal pair of ØMSB,add and ØMSB,rst, are generated for each row depending on the input vector bits. The signals Ø0, Ø1, Ø2 are active throughout the whole operation. In variants, the signals Ø0, Ø1, Ø2 may occasionally be turned off, e.g., for a few cycles, when the input bits are 0, in order to save energy.
As evoked earlier, each memory system may include N serially-connected memory elements 1551, each storing a respective bit of the corresponding N-bit weight. The last memory element 1551 (corresponding to bit by and capacitor CN in
Basically, each bit of the stream of M bits received at the last memory element is associated with a respective group of clock cycles, as per the 3-phase clocking scheme discussed above, which results in a sequence of M groups of cycles. By performing a successive and repetitive pipelined application of the 3-phase signals during a given one of the M groups, a phase signal is applied during each cycle of the given group. This allows the CU 1552 to map digital values stored in each memory element into a word proportional voltage, and to transfer the word proportional voltages of the capacitors C1 to CN-1 to the last capacitor CN such that the voltage VCN across the last capacitor CN is the analogue voltage that corresponds to the N-bit word scaled by the bit associated with that group. The output block 16 adequately reconstructs the expected value based on the bit positions corresponding to the groups used in the bit partition. As explained earlier, each CU 1552 preferably comprises N charge adding units, which are connected to respective memory elements 1551 via respective switching logics, see
A preferred flow is shown in
Computerized devices 10, 10a and systems 1 can be suitably designed for implementing embodiments of the present invention as described herein. In that respect, it can be appreciated that the methods described herein are essentially non-interactive, i.e., automated. Automated parts of such methods can be implemented in hardware only, or as a combination of hardware and software. In exemplary embodiments, automated parts of the methods described herein are implemented in software, which is executed by suitable digital processing devices. In particular, the methods described herein may involve executable programs, scripts, or, more generally, any form of executable instructions, be it to instruct to perform core computations at the devices 10, 10a. The required computer readable program instructions can for instance be downloaded to processing elements from a computer readable storage medium, via a network, for example, the Internet and/or a wireless network. However, all embodiments described here involve analogue computations performed thanks to crossbar array structures and compute units described in sections 2 and 3.
While the present invention has been described with reference to a limited number of embodiments, variants, and the accompanying drawings, it will be understood by those skilled in the art that various changes may be made, and equivalents may be substituted without departing from the scope of the present invention. In particular, a feature (device-like or method-like) recited in a given embodiment, variant or shown in a drawing may be combined with or replace another feature in another embodiment, variant, or drawing, without departing from the scope of the present invention. Various combinations of the features described in respect of any of the above embodiments or variants may accordingly be contemplated, that remain within the scope of the appended claims. In addition, many minor modifications may be made to adapt a particular situation or material to the teachings of the present invention without departing from its scope. Therefore, it is intended that the present invention is not limited to the particular embodiments disclosed, but that the present invention will include all embodiments falling within the scope of the appended claims. In addition, many other variants than explicitly touched above can be contemplated. For example, other types of memory elements can be contemplated.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2022/059113 | 4/6/2022 | WO |