MULTI-BIT ANALOG MULTIPLY-ACCUMULATE OPERATIONS WITH MEMORY CROSSBAR ARRAYS

Information

  • Patent Application
  • 20250224930
  • Publication Number
    20250224930
  • Date Filed
    April 06, 2022
    3 years ago
  • Date Published
    July 10, 2025
    4 days ago
Abstract
The invention is notably directed to a method of processing data. The method relies on a memory device having a crossbar array structure. The latter includes K×L cells, which interconnect K rows and Z columns. The cells include respective memory systems, which store respective A-bit weights. The memory systems are connected to respective compute units, which are configured as interleaved switched-capacitor analogue multipliers and adders. According to the proposed method, input signals encoding respective M-bit input words are synchronously applied to respective ones of the K rows. The compute units are operated according to a 3-phase clocking scheme, with a view to obtaining MAC results for each of the L columns, where K≥2, L>2, N≥2, and M≥2. Remarkably, the 3-phase clocking scheme is here set to perform n×m partial multiplications, in the analogue domain, according to a specific bit partition, so as to obtain n×m partial output signals in output of each of the compute units. This partition decomposes each of the N-bit weights into n groups of bits and each of the M-bit input words into m groups of bits. Each of the n groups and the m groups includes at least one bit. However, at least one of the n groups and/or the m groups includes at least two bits, whereby N+M>n+m≥3. Moreover, the MAC results are obtained by summing the partial output signals obtained by the compute units for each of the Z columns. The summed output signals are converted into digital signals encoding partial values. The partial values are shifted according to corresponding bit positions, which are set in accordance with the bit partition, and the shifted values are finally added, so as to recompose the desired output vector components. The invention is further directed to related apparatuses and systems.
Description
BACKGROUND

The invention relates in general to the field of in-and near-memory processing techniques (i.e., methods, apparatuses, and systems) and related acceleration techniques. In particular, it relates to a method of processing data using memory devices having a crossbar array structure, augmented with compute units configured as interleaved switched-capacitor analogue multipliers and adders, where the compute units are operated according to a 3-phase clocking scheme, which is set to perform partial multiplications and additions in the analogue domain according to a certain bit partition.


Matrix-vector multiplications (MVMs) are frequently needed in several applications, such as technical computing applications and, in particular, cognitive tasks. Examples of such cognitive tasks are the training of, and inferences performed with, cognitive models such as neural networks for computer vision and natural language processing, and other machine learning models such as those used for weather forecasting and financial predictions.


MVM operations pose multiple challenges, because of their recurrence, universality, matrix size, and memory requirements. On the one hand, there is a need to accelerate these operations, notably in high-performance computing applications. On the other hand, there is a need to achieve an energy-efficient way of performing them.


Traditional computer architectures are based on the von Neumann computing concept, where processing capability and data storage are split into separate physical units. Such architectures suffer from congestion and high-power consumption, as data must be continuously transferred from the memory units to the control and arithmetic units through physically constrained and costly interfaces.


One possibility to accelerate MVMs is to use dedicated hardware acceleration devices, such as dedicated circuits having a crossbar array configuration. This type of circuit includes input lines and output lines, which are interconnected at cross-points defining cells. The cells contain respective memory devices (or sets of memory devices), which are designed to store respective matrix coefficients. Vectors are encoded as signals applied to the input lines of the crossbar array to perform multiply-accumulate (MAC) operations. There are several possible implementations. For example, the coefficients of the matrix (“weights”) can be stored in columns of cells. Next to every column of cells is a column of arithmetic units that can multiply the weights with input vector values (creating partial products) and finally accumulate all partial products to produce the outcome of a full dot-product. Such an architecture can simply and efficiently map a matrix-vector multiplication. The weights can be updated by reprogramming the memory elements, as needed to perform matrix-vector multiplications. Such an approach breaks the “memory wall” as it fuses the arithmetic-and memory unit into a single in-memory-computing (IMC) unit, whereby processing is done much more efficiently in or near the memory (i.e., the crossbar array).


The following paper forms part of the background art:

    • R. Khaddam-Aljameh, P.-A. Francese, L. Benini and E. Eleftheriou, “An SRAM-Based Multibit In-Memory Matrix-Vector Multiplier With a Precision That Scales Linearly in Area, Time, and Power,” in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 29, no. 2, pp. 372-385, February 2021, doi: 10.1109/TVLSI.2020.3037871, hereafter referred to as “PA1”;
    • US Patent Document U.S. Pat. No. 10,777,253 (B1), “Memory array for processing an N-bit word”, R. Khaddam-Aljameh, M. Le Gallo-Bourdeau, A. Sebastian, E. Eleftheriou, and A. Francese, hereafter “PA2”;
    • Digest of Technical Papers (pp. 236-238). [9365788] (Digest of Technical Papers—IEEE International Solid-State Circuits Conference; Vol. 64), Institute of Electrical and Electronics Engineers Inc., https://doi.org/10.1109/ISSCC42613.2021.9365788, hereafter “PA3”;
    • Jia, H., Ozatay, M., Tang, Y., Valavi, H., Pathak, R., Lee, J., & Verma, N. (2021), “A Programmable Neural-Network Inference Accelerator Based on Scalable In-Memory Computing,”, in 2021 IEEE International Solid-State Circuits Conference, ISSCC 2021;
    • F.-J. Wang, G. C. Temes and S. Law, “A quasi-passive CMOS pipeline D/A converter,” in IEEE Journal of Solid-State Circuits, vol. 24, no. 6, pp. 1752-1755 December 1989, doi: 10.1109/4.45017; and
    • P. F. Ferguson, X. Haurie and G. C. Temes, “A highly linear low-power 10 bit DAC for GSM,” Proceedings of the IEEE 2000 Custom Integrated Circuits Conference (Cat. No.00CH37044), 2000, pp. 261-264, doi: 10.1109/CICC.2000.852662.


The document PA1 discloses techniques of operating a memory device having a crossbar array structure, where the crossbar array structure includes cells interconnecting rows and columns. The cells include memory elements (namely static random-access memory elements, or SRAM elements) storing respective N-bit weights. The memory elements are connected to respective in-memory compute units, or IMCUs. The IMCUs are collocated with the memory elements in the array, as depicted in FIG. 1 (b) of PA1, which corresponds to FIG. 4 in the drawings accompanying the present document. The IMCUs are configured as interleaved switched-capacitor analogue multipliers and adders, which are designed to efficiently perform the matrix-vector multiplications. The crossbar array structure is operated by applying input signals encoding respective M-bit input words to respective rows. The IMCUs are operated according to a 3-phase clocking scheme, to obtain multiply-accumulate (MAC) results for each column.


Each IMCU first converts an N-bit weight into a proportional voltage using a pipeline of digital-to-analogue converter (DAC) built from N+1 equally sized stages. A switched-capacitor stage then multiplies these voltages with the M-bit digital input activation. Finally, the output voltages that correspond to the different multiplication results are accumulated along each column by means of charge sharing.


In more detail, the interleaved switched-capacitor circuit shown in FIG. 2 of PA1 causes each pipelined DAC to generate a voltage, which is proportional to the stored weight bits representing the unsigned weight. The sign of the precharge voltage is selected based on the sign of both the input and weight. An analogue multiplier performs a multibit multiplication as a series of binary multiplication steps, by suitably controlling switches. Based on each input bit, either zero or a weight with a proportional amount of charge is added to an output capacitor. An analogue accumulator performs the summation of all multiplication results of the IMCUs along each column by means of charge sharing.


The 3-phase clocking scheme used to operate the IMCUs is illustrated in FIG. 4 of PA1. The 3-phase clocking scheme causes the IMCUs to perform N×M multiplications. A sequence of M groups of clock cycles are associated with respective sequence of M bits, corresponding to the input words. A phase signal is applied during each clock cycle of the M groups. The clocking scheme causes an additional input bit to be processed every three clock cycles, until all bits of the input magnitude have been multiplied and accumulated.


Thanks to the collocated architecture, the IMCU circuits, and the 3-phase clocking scheme proposed in PA1, the required circuit area, computation time, and power consumption, scale linearly with the bit resolution of both the inputs and the weights.


The document PA2 discloses similar clocking schemes, interleaved switched-capacitor circuits, and crossbar architectures. So, IMCUs configured as interleaved switched-capacitor analogue multipliers and adders are known per se, as well as the 3-phase clocking schemes to operate them.


The document PA3 presents a scalable neural-network inference accelerator based on an array of programmable cores employing mixed-signal in-memory computing, digital near-memory computing, and localized buffering/control. The compute units are operated based on a multi bit-slicing approach, resulting in N×M partial multiplications at each cell. Bit slicing is applied to the input vector elements, which are mapped onto voltage vector inputs to the crossbar array, one at a time. To perform an in-place matrix-vector multiplication, a vector slice is multiplied with a matrix slice, with O(1) time complexity, and the partial products of these operations are combined outside of the crossbar array device through a shift-and-add reduction network.


More generally, various IMC approaches have been proposed. In general, the MVMs can be performed in the digital or analogue domain. Implementations in the analogue domain can show better performance in terms of area and energy-efficiency when compared to fully digital IMCs. This, however, comes at the cost of a limited computational precision.


Physical implementations of analogue IMC circuitry in CMOS (e.g., using SRAM or equivalent memory technology) often rely on switched capacitors circuits where the multi-bit MVMs are executed in a single time step, as in PA1 or PA2. Alternatively, such operations can be performed as a combination of binary operations (i.e., “bit-slicing”) in the analogue domain, followed by analogue-to-digital conversion (using analogue-to-digital converters, or ADCs) and then shift-and-add operations on the partial results, as in PA3. Other approaches rely on SRAM cells, which exploit binary inputs and binary weights. Alternatively, one can also use phase-change memory (PCM) technology for multibit operations, albeit with limited precision.


Each analogue computation mode, i.e., single-bit or multibit analogue computation mode, has its pros and cons. Performing multi-bit operations in a single step in the analogue domain may limit the analogue signal range, incur more noise, and complicate the ADC and DAC design, while a fully bit-sliced mode requires more ADC conversion steps, which, in turn, incurs higher latency and consumes more energy.


After a perusal investigation of the available IMC approaches and related techniques, the present inventors came up with new designs and operation methods of memory devices based on crossbar array structures, which make it possible to reduce analogue compute signal-to-noise ratio requirements, while making full use of pipelining and thus maximizing the system throughput.


SUMMARY

According to a first aspect, the present invention is embodied as a method of processing data. The method relies on a memory device having a crossbar array structure. The latter includes K×L cells, which interconnect K rows and L columns. The cells include respective memory systems, which store respective N-bit weights. The memory systems are connected to respective compute units, which are configured as interleaved switched-capacitor analogue multipliers and adders. According to the proposed method, input signals encoding respective M-bit input words are synchronously applied to respective ones of the K rows. The compute units are operated according to a 3-phase clocking scheme, with a view to obtaining MAC results for each of the L columns, where K≥2, L≥2, N≥2, and M≥2.


Remarkably, the 3-phase clocking scheme is here set to perform n×m partial multiplications, in the analogue domain (i.e., as analogue operations), according to a specific bit partition, so as to obtain n×m partial output signals in output of each of the compute units. This partition decomposes each of the N-bit weights into n groups of bits and each of the M-bit input words into m groups of bits. Each of the n groups and the m groups includes at least one bit. However, at least one of the n groups and/or the m groups includes at least two bits, whereby N+M>n +m≥3. Moreover, the MAC results are obtained by summing the partial output signals obtained by the compute units for each of the L columns. The summed output signals are converted into digital signals encoding partial values. The partial values are shifted according to corresponding bit positions, which are set in accordance with the bit partition, and the shifted values are finally added, so as to recompose the desired output vector components.


The present approach relies on a specific bit partition, which can be regarded as resulting in a granular bit slicing. This proposed solution reduces the analogue compute signal-to-noise ratio requirements. At the same time, the proposed approach can maintain the pipeline behaviour of the system, yet without impacting the throughput.


In embodiments, the granularity of the bit partition of the N-bit weights and the M-bit input words is asymmetric. That is, an average number of bits of the n groups differs from an average number of bits of the m groups. Even if the groups do not need to have a same number of bits, simpler implementations are nevertheless achieved by imposing each of the n groups to have a same number v of bits and, similarly, each of the m groups to have a same number μ of bits, though v may differs from μ. For example, the bit partition may be designed so as to decompose each of the N-bit weights into n groups of y bits, such that N=n×v, where v≥2, and each of the M-bit input words into a single group of M bits, whereby m=1. A preferred variant is to decompose each of the M-bit input words into m groups of μ bits, such that M=m'μ, where μ≥2, and each of the N-bit weights into a single group of N bits, whereby n=1. This allows an easier operation of the compute unit.


In embodiments, the compute units are collocated with the respective memory systems to which they are connected and form part of the respective cells. Thus, the n×m partial multiplications are efficiently performed as in-memory operations in the memory device in that case. In variants, the compute units are arranged in a near-memory (analogue) processing unit, as discussed below.


The MAC results are typically obtained via a readout circuitry, which includes analogue-to-digital converters (ADCs) and digital shift-and-adder circuits. The ADCs are connected to respective columns of the compute units for converting the partial output signals as summed for each of the L columns into the digital signals. Note, the compute units form columns, whether collocated with the memory systems in the cells, or not. The digital shift-and-adder circuits are connected in output of respective ones of the ADCs for shifting the partial values and adding the shifted values. The readout circuitry is preferably co-integrated with the crossbar array structure in the memory device.


In preferred embodiments, the compute units are operated thanks to first control signals, which include 3-phase signals for implementing the 3-phase clocking scheme. The MAC results are obtained by applying second control signals, which are in phase with the 3-phase signals, so as to enable a synchronous operation of the compute units and the readout circuitry. The second control signals include first activation signals and second activation signals. The first activation signals activate the ADCs for converting the partial output signals. The second activation signals activate the digital shift-and-adder circuits for shifting the partial values and adding the shifted values.


Preferably, the 3-phase clocking scheme spans a sequence of clock cycles, wherein the sequence decomposes into M sets of clock cycles associated with respective M bits of the M-bit input words. The 3-phase signals are repeatedly applied, M times, during the M sets of clock cycles. Each of the M sets includes three clock cycles, during which the 3-phase signals are successively applied, such that only one phase signal of the 3-phase signals is applied during a single one of the three clock cycles.


In embodiments, each memory system of the memory systems of each cell of the K×L cells consists of N serially-connected memory elements, each storing a respective bit of one of the N bits of the N-bit weights that is stored in said each cell. A last memory element of the memory elements of each memory system is configured to receive a respective signal of the applied signals, the respective signal encoding a sequence of M bits. Preferably, each compute unit comprises a set of charge adding units and a corresponding set of switching logics. Namely, each of the compute units comprises N charge adding units, which are connected to respective ones of the N serially-connected memory elements via respective switching logics.


In preferred embodiments, the method further comprises performing one or more further operations based on the MAC results obtained, thanks to a near-memory digital processing unit connected in output of the readout circuitry, which allows efficient computing for technical computing applications such as machine learning.


Preferably, the method further comprises optimizing bit cardinalities of the n groups of bits and the m groups of bits with respect to computational precision, latency, and/or energy consumption.


According to another aspect, the invention is embodied as a hardware processing apparatus. The apparatus comprises a memory device and an electronic circuit. The memory device has a crossbar array structure including K×L cells interconnecting K rows and L columns. The cells include respective memory systems storing respective N-bit weights. The apparatus includes K ×L compute units, which may advantageously form part of the cells. The compute units are connected to respective ones of the memory systems of the K×L cells. Again, the compute units are configured as interleaved switched-capacitor analogue multipliers and adders. Consistently with the present methods, the electronic circuit is configured to synchronously apply input signals encoding respective M-bit input words to respective ones of the K rows, operate the compute units according to a 3-phase clocking scheme, and obtain MAC results for each of the L columns, where K≥2, L≥2, N≥2, and M≥2.


Moreover, the electronic circuit is further configured to set the clocking scheme to perform n ×m partial multiplications (in the analogue domain) according to a specific bit partition. As explained above, this partition decomposes each of the N-bit weights into n groups of bits and each of the M-bit input words into m groups of bits, wherein each of the n groups and the m groups includes at least one bit, but at least one of the n groups and/or the m groups includes at least two bits, whereby N+M>n+m≥3. The aim is to obtain n×m partial output signals by each of the compute units. Moreover, the electronic circuit is further configured to obtain the MAC results by summing the partial output signals obtained by the compute units for each of the L columns, converting the summed output signals into digital signals encoding partial values, shifting the partial values according to corresponding bit positions set in accordance with the bit partition, and adding the shifted values to recompose the desired output vector components.


As said, the compute units are preferably collocated with the memory systems to which they are connected and form part of the respective cells, whereby the n×m partial multiplications are performed in-memory, in operation. In variants, the apparatus further comprises a near-memory processing unit, where the latter includes the compute units. The near-memory processing unit may possibly be co-integrated with the crossbar array structure in the memory device.


Preferably, the electronic circuit includes a readout circuitry, which comprises ADCs and digital shift-and-adder circuits. The ADCs are connected in output of respective columns of the compute units, to convert the n×m partial output signals into the digital signals that encode said partial values, in operation. The digital shift-and-adder circuits are connected in output of respective ones of the ADCs to shift the partial values according to corresponding bit positions set in accordance with the bit partition, and add the shifted values, in operation. The readout circuitry is preferably co-integrated with the crossbar array structure in the memory device.


In embodiments, each of the memory systems of the cells includes serially connected memory elements, e.g., static random-access memory elements. The memory elements are designed to store respective bits of a respective one of the N-bit weights, in operation.


Preferably, the electronic circuit further includes an input unit (configured to apply said input signals), as well as control components. The latter are configured to operate the compute units by applying first control signals. The latter include 3-phase signals for implementing the 3-phase clocking scheme. The control components are further configured to operate the readout circuitry to obtain the MAC results. In operation, this is achieved by applying second control signals in phase with the 3-phase signals. The second control signals include first activation signals to activate the ADCs for converting the partial output signals. They further include second activation signals to activate the digital shift-and-adder circuit for shifting the partial values and adding the shifted values.


In preferred embodiments, the apparatus further includes a near-memory digital processing unit, which is preferably cointegrated with the crossbar array structure. The near-memory digital processing unit is connected in output of the readout circuitry and configured to perform operations based on the MAC results obtained at the readout circuitry.


According to another aspect, the invention is embodied as a computing system, which includes one or more hardware processing apparatuses as described above. Preferably, the computing system further comprises a memory unit and a general-purpose processing unit that is connected to the memory unit to read data from, and write data to, the memory unit. Each of the hardware processing apparatuses is configured to read data from, and write data to, the memory unit. The general-purpose processing unit is configured to: map a given computing task to vectors and weights; instruct to store said weights as N-bit weights in the cells of any of the hardware processing apparatuses; and instruct to apply input signals encoding vector components of such vectors as M-bit input words to rows of any of the hardware processing apparatuses, so as to perform such a computing task, in operation.





BRIEF DESCRIPTION OF THE DRAWINGS

These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings. The illustrations are for clarity in facilitating one skilled in the art in understanding the invention in conjunction with the detailed description. In the drawings:



FIG. 1 schematically represents a computerized system, in which a user interacts with a server, via a personal computer, in order to offload matrix-vector product calculations to dedicated hardware accelerators, as in embodiments of the invention;



FIGS. 2 and 3 schematically represent selected components of hardware processing apparatuses including crossbar array structures and compute units, according to embodiments. In FIG. 2, the compute units are collocated with memory systems, to which they are connected; the compute units form part of respective cells of the crossbar array structure. In FIG. 3, the compute units are arranged in a near-memory processing unit, in output of the crossbar array structure;



FIG. 4 schematically illustrates the architecture of an in-memory computing system according to the prior art, see the background section, where the compute units are collocated with memory elements of respective cells of the crossbar array structure;



FIG. 5 schematically illustrates the architecture of an in-memory computing system according to embodiments. As in FIG. 2, the compute units are collocated with memory systems of respective cells of the crossbar array structure. Not only the columns of compute units are connected to analogue-to-digital converters, as in FIG. 4, but, in addition, converters are connected to digital shift-and-adder circuits, to exploit a bit partition that decomposes each N-bit weights of the crossbar array structure into n groups of bits and each M-bit input words into m groups of bits;



FIG. 6 is a schematic of compute units corresponding to a same column of the crossbar array structure, as involved in preferred embodiments. Only one compute unit is shown in detail, though. Each compute unit is configured as an interleaved switched-capacitor analogue multiplier and adder;



FIG. 7 is a timing diagram for control signals used in a 3-phase clocking scheme to perform partial multiplications in the analogue domain at each compute unit, as in embodiments. The aim is to enable a bit partition scheme as evoked above;



FIG. 8 illustrates the operation of multi-bit multiplications by compute units as shown in FIG. 6, based on a preferred bit partition, as used in embodiments; and



FIG. 9 is a flowchart illustrating high-level steps of a method of processing data, according to embodiments.





The accompanying drawings show simplified representations of devices or parts thereof, as involved in embodiments. Technical features depicted in FIGS. 2-6 are not to scale. Similar or functionally similar elements in the figures have been allocated the same numeral references, unless otherwise indicated.


Apparatuses, systems, and methods, embodying the present invention will now be described, by way of non-limiting examples.


DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

The following description is structured as follows. General embodiments and high-level variants are described in section 1. Section 2 addresses particularly preferred embodiments and technical implementation details. The present method and its variants are collectively referred to as the “present methods”. All references Sn refer to methods steps of the flowcharts of FIG. 9, while numeral references pertain to systems, apparatus, devices, components, and concepts, involved in embodiments of the present invention.


1. General Embodiments and High-Level Variants

In reference to FIGS. 2, 3, 5, and 9, a first aspect of the invention is now described in detail. This aspect concerns a method of processing data. The method relies on a memory device 10, 10a, which has a crossbar array structure 15, 15a. Examples of such memory devices are shown in FIGS. 2, 3, and 5.


The crossbar array structure 15, 15a includes K×L cells 155, 155a. In the present document, each cell is defined as a repeating unit that interconnects a row and a column. I.e., the cells interconnect K rows and L columns, where K≥2 and L≥2. In FIGS. 2 and 3, the first column is patterned by upward diagonal stripes, while the first row has downward diagonal stripes. A cell corresponds to the intersection of a row and a column. As known per se, each row includes one or more input lines and each column includes one or more output lines, which are interconnected at cross-points (i.e., junctions). I.e., each row and each column may in fact involve a plurality of input lines and output lines. In bit-serial implementations, each cell can be connected by a single physical line, which suffices to feed input signals carrying the N-bit input words. In parallel data ingestion approaches, however, parallel conductors may be used to connect to each cell. I.e., bits are injected in parallel via parallel conductors to each of the cells.


Each cell 155, 155a includes a respective memory system 157, see FIG. 5. The memory systems 157 store respective N-bit weights (N≥2), corresponding to matrix elements used to perform matrix-vector multiplications (MVMs). Each memory system 157 preferably includes serially connected memory elements 1551, where such elements store respective bits of the weight stored in the corresponding cell, as illustrated in FIG. 5. The memory elements may for instance be static random-access memory (SRAM) devices. As per the above definitions, each cell corresponds to one cross-point and is assumed to include exactly one memory system 157, which itself may include several memory elements, e.g., SRAM devices. A sub-cell is defined as including exactly one such memory element.


The memory systems 157 are connected to respective compute units (CUs) 1552, 1552a. The CUs may possibly be collocated with the memory systems 157 (i.e., within the crossbar array structure 15, as shown in FIGS. 2 and 5) or be arranged in a near-memory processing unit 19, as assumed in FIG. 3. In each case, the CUs are configured as interleaved switched-capacitor analogue multipliers and adders, similar to the circuit designs proposed in the document PA1in PA2, subject to differences discussed later.


As in typical IMC architectures, the matrix elements that are stored in the memory systems remain stationary (at least during a given MVM calculation cycle), whereas processing occurs via the CUs. Specifically, the stationary matrix elements (i.e., the weights) are stored in the array of memory systems, while input vector components are fed from the outside to the L rows, as illustrated in FIGS. 4 and 5.


The present memory devices 10, 10a are operated as follows. Input signals are synchronously applied to respective rows of the crossbar array 15, 15a, which corresponds to step S50 in the flow of FIG. 9. Such signals encode respective M-bit input words, where M>2. Moreover, the CUs are operated (step S60) according to a 3-phase clocking scheme, with a view to obtaining S70-S74 multiply-accumulate (MAC) results for each of the L columns. A 3-phase clocking scheme is a scheme that basically relies on three non-overlapping signals (e.g., signal pulses), which all have the same duration, where the signals are both successively and repeatedly applied, but only one of these signals is applied during a single clock cycle. Such a scheme is discussed in the prior art documents cited in the background section.


However, by contrast with the 3-phase clocking scheme used in the documents PA1 and PA2, here the 3-phase clocking scheme is set to perform partial multiplications in the analogue domain according to a specific bit partition, which can be regarded as a granular bit slicing, involving multi-bit analogue operations. That is, the CUs are operated to perform n×m partial multiplications, so as to obtain n×m partial output signals in output of each of the CUs. This partition decomposes each of the N-bit weights into n groups of bits. Similarly, it decomposes each of the M-bit input words into m groups of bits. Still, the numbers (n and m) of groups are subject to certain constraints, which depart from the schemes proposed in the documents PA1-PA3. Namely, each of the n groups and the m groups includes at least one bit but at least one of the n groups and/or at least one of the m groups includes at least two bits, hence the granular bit slicing evoked above.


In more detail, the numbers (n and m) of groups are subject to the constraints N+M>n+m>3. According to the above definitions, at least one of the n and m groups includes more than one bit, whereby one has either 1<n and 1≤ m, or 1≤n and 1<m. In addition, there are at most N+M-1 groups in total, such that N+M>n+m≥3. The n groups do not need to have the same number of bits as the m groups. For example, in preferred embodiments, m is strictly less than M but strictly more than 1 (e.g., m=2), while n=1. Conversely, n may be strictly less than N but larger than 1, while m=1.


Plus, the number of bits can vary in each of the n groups and/or each of the m groups. That is, the number of bits can vary from one of the n groups to the other, and/or from one of the m groups to the other. The partition can actually be optimized against specific applications, this corresponding to step S20 in the flow of FIG. 9. Thus, various decomposition schemes can be contemplated, as further discussed later in detail.


In the present context, the MAC results are obtained S70-S74 column-wise, in three steps. First, the partial output signals obtained by the CUs for each of the L columns are summed, which operation results from the CU design. The summed output signals are converted S72 into digital signals. The converted signals encode partial values. The latter are shifted S74 according to their corresponding bit positions. I.e., such positions are set in accordance with the bit partition used. Finally, the shifted values are added S74, which leads to the desired result, i.e., a vector component yj, where j=1, . . . , L, see the example of FIG. 5.


Comments are in order. In the present context, cells 155, 155a should be distinguished from mere memory systems 157, inasmuch as the cells are connected to CUs 1552, 1552a. The reference 1552 refers to CUs that are collocated with the memory systems 157 in the array 15, as illustrated in FIG. 2 or 5. In that case, one speaks of in-memory CUs, or IMCUs 1552. In variants, the CUS 1552a are external to the array, yet arranged in close proximity with the memory systems 157, i.e., in a near-memory (analogue) processing unit 19, as assumed in FIG. 3. In that case, the CUs form an array of near-memory CUs (or NMCUs). So, the present CUs 1552, 1552a may form an in-memory compute system or a near-memory compute system, respectively leading to in-memory computing and near-memory computing operations. Thus, in general, the present methods may process data in-memory or using near-memory processing. Preferred is to perform such operations in-memory, in the interest of efficiency and power consumption. However, one may also want to implement the CUs in a near-memory processing unit, be it to be able to reuse existing crossbar array devices.


The bit partition used causes the CUs to perform multibit multiplications as a series of multi-binary multiplication steps. Instead of performing purely binary bit multiplications (as in PA3), at least some of the multiplications involves groups of several bits. That is, a certain granularity is exploited to optimize performance of the MAC operations, by contrast with the solution proposed by PA1, PA2, and PA3. The present bit partitions cause to decompose the multiplication of an input word and a weight as n×m partial multiplications, based on n groups of bits stemming from the stored weight and m groups of bits representing the input word. In order words, the signals resulting from the partial multiplications are formed as n×m partial output signals, for each cell.


If the CUs are internal (i.e., collocated with the memory systems, as in FIG. 2 or 5), the underlying device 10 (or apparatus) forms an in-memory computing device (or apparatus), where each of the n×m analogue signals outputted from the cells are added in the analogue domain with the corresponding partial output analogue signals of the other CUs on the same column. Then, the added signals are processed in output of each column (e.g., in a respective readout circuitry 16), where they are converted to digital values, shifted in accordance with the bit partition scheme and then summed, in order to reconstruct the expected MAC result of each column.


The scheme is logically similar when the CUs are external (yet connected to the respective memory systems 157), except that data exchanges occur over slightly larger distances, i.e., between the crossbar array 15a and the unit 19 in FIG. 3. In both cases, however, n×m conversions occur before shifting and adding the signals. In less preferred variants, intermediate conversions can be performed at the level of each cell (or subgroups of cells), which, however, involves additional conversions and thus, additional latency.


As noted earlier, the underlying device 10, 10a is operated in a synchronous manner, whereby the CUs 1552, 1552a are operated synchronously with the input signals applied. The MAC results are finally obtained by shifting and adding the converted values synchronously with the operation of the CUs. To that aim, use can be made of in-phase control signals. FIG. 6 shows an example of a detailed circuit-level implementation of the CUS, while FIG. 7 shows a possible modulation scheme, which is adjusted to support the granular bit-slicing, while maintaining a full pipelining. FIGS. 6 and 7 are described later in detail.


The above operations may possibly be complemented by further operations executed by a near-memory digital processing unit 17, 17a, connected in output of the readout circuitry 16, 16a.


In particular, the present methods may further comprise performing S80 one or more further operations based on the MAC results obtained at step at step S74, thanks to such a near-memory digital processing unit 17, 17a, as assumed in the flow of FIG. 9. Having such a near-memory digital processing unit 17, 17a comes in handy for a number of applications, starting with machine learning applications. Note, the processing unit 17, 17a should be distinguished from the near-memory processing unit 19 implementing NMCUs 1552a, as in embodiments such as shown in FIG. 3.


The underlying device (or apparatus) 10, 10a typically includes an electrical input unit 11 to apply input signals to the input lines forming the rows, as well as other components (e.g., control units, pre-/post-processing units, etc.), which are preferably co-integrated in a single device. Such a device (or apparatus) concerns another aspect of the invention and may notably be used in a computerized system, which concerns a further aspect. These other aspects are addressed later.


To summarize, the present methods describe an analogue MVM implementation for multi-bit weights and inputs, where the analogue multiplication of weights and inputs are performed at a granularity of a defined number of bits at a time. The underlying architecture, which relies on CUs that are configured as interleaved switched-capacitor analogue multipliers and adders, allows an optimized pipeline operation mode. Unlike the multi bit-slicing scheme used in PA3, the presented invention can make full use of pipelining and thus maximize the system throughput.


To fix ideas, PA3 can be regarded as involving N x M partial multiplications at the cells (where N=4 and M=4). These operations consist of single bit operations, which do not involve any group, unlike the present bit partition. Conversely, the operations performed in the documents PA1 and PA2 can be regarded as involving a single multiplication (m=1 and n=1); the notion of groups and partition are absent in that case. On the contrary, the present approach institutes a bit partition, which results in a granular bit slicing. As it can be realized, this granular bit slicing reduces the analogue compute signal-to-noise ratio (SNR) requirements. At the same time, the proposed approach can maintain the pipeline behaviour of the system (which requires adjusting the pulse modulation scheme), yet without impacting the throughput.


Another aspect of the invention concerns a hardware processing apparatus 10, 10a. Several features of the apparatus have already been described above in reference to the present methods, be it implicitly. Such features are only briefly described in the following.


To start with, the apparatus includes a memory device 10, 10a such as described above. The apparatus notably includes CUs 1552, 1552a, which may form part of the cells, or not. In all cases, the CUs are connected to respective memory systems 157 of the cells and are configured as interleaved switched-capacitor analogue multipliers and adders. Moreover, the apparatus includes an electronic circuit, which is configured to synchronously apply input signals encoding M-bit input words to respective rows, operate the CUS 1552, 1552a according to a 3-phase clocking scheme, and obtain MAC results for each of the columns, as discussed above. Consistently with the present methods, the electronic circuit is further configured to set the clocking scheme, so as for the CUs to perform partial multiplications in the analogue domain according to a specific bit partition, which results in the granular bit slicing described above. The partial multiplications are performed on continuous analogue signals, using analogue processing, as opposed to digital signal processing. For completeness, the electronic circuit causes to obtain the MAC results by: (i) summing the partial output signals obtained by the CUs 1552, 1552a for each column; (ii) converting the summed output signals into digital signals encoding partial values; and (iii) shifting the partial values according to corresponding bit positions (which are set in accordance with the bit partition) and adding the shifted values. As discussed earlier, the CUS 1552 may advantageously be collocated with the memory systems 157, as assumed in FIG. 2 or 5, or form part of a near-memory processing unit 19, as in FIG. 3. In both cases, the CUs can be regarded as forming L columns, whether physically integrated in the cells of the memory device or not. Note, in variants, some of the CUs may possibly be shared across some of the columns.


The near-memory processing unit 19 is preferably co-integrated with the crossbar array structure 15, 15a. The apparatus 10, 10a may further includes additional units, e.g., an input unit 11, a readout circuitry 16, 16a, and a near-memory digital processing unit 17, 17a. In addition, the apparatus 10, 10a will likely include an input/output unit 18, to interface the apparatus with external computers (not shown in FIGS. 2, 3, and 5). This unit 18 is typically a logic circuitry, e.g., a processor or, even, a full computer.


In general, one or more, possibly all, of the above units 11, 17, 17a, 18, 19 may be co-integrated with the crossbar arrays of the devices 10, 10a. So, the apparatus 10, 10a may possibly be embodied as a single, integrated device 10, 10a, should all involved components be co-integrated with the crossbar array 15, 15a. Note, in that respect, the devices 10, 10a shown in FIGS. 2, 3, and 5, are assumed to be integrated devices. E.g., such devices can for instance be implemented as part of application-specific integrated circuit devices.


In embodiments, each memory system 157 of the cells 155, 155a includes serially connected memory elements 1551. The memory elements are designed to store respective bits of a respective N-bit weights, in operation. Preferably, the memory elements 1551 are SRAM elements 1551. Besides SRAM elements, however, other memory technologies can be contemplated, such as technologies relying on sense amplifiers (SA). In particular, the memory elements may be dynamic random-access memory (DRAM) elements. SAs are used to perform local read operations. The SAs do typically not need to have adjustable threshold levels; one single threshold is sufficient to detect zeros or ones. In variants, however, the SAs may have adjustable threshold levels, so as to be able to read several levels. More generally, use can be made of volatile or nonvolatile memory technology. In particular, the memory elements may be binary phase-change memory (PCM) elements, magnetoresistive random access memory (MRAM), or resistive-random access memory (ReRAM). All such memory elements can potentially be used in conjunction with CUs 1552, 1552a described above to provide multibit MAC computing capabilities.


A final aspect concerns a computing system 1, such as depicted in FIG. 1. Such a system 1 includes one or more hardware processing apparatuses 10, 10a (or in fact integral memory devices) such as described above. In the example of FIG. 1, each apparatus is assumed to be a device 10 such as shown in FIG. 2.


In addition, the computing system 1 may typically include a memory unit 2 and a general-purpose processing unit 2, which is connected to the memory unit to read data from, and write data to, the memory unit. In the example of FIG. 1, the memory unit and the general-purpose processing unit are assumed to form part of a same computerized unit 2, e.g., a server computer, which may interact with clients 4, who may be persons (interacting via personal computers 3, as assumed in FIG. 1), processes, or machines.


Each hardware processing apparatus 10 in the system 1 is configured to read data from, and write data to, the memory unit 2. Client requests are managed by the general-purpose processing unit 2, which is notably designed to map a given computing task to vectors and weights. Note, the system 1 may in fact includes a memory system composed of several memory units. Similarly, the system may include several processing units.


The processing unit 2 is notably configured to instruct to store S30 weights as N-bit weights in the cells 155 of any of the hardware processing apparatuses 10, 10a involved in the system 1. For completeness, the processing unit 2 can instruct to apply S50 input signals encoding vector components of vectors as M-bit input words to rows of any of the hardware processing apparatuses, with a view to performing a computing task. The system 1 may for instance be a composable disaggregated infrastructure, which may include hardware devices 10, 10a as described above along with other hardware acceleration devices, e.g., application-specific integrated circuits (ASICs) and/or field-programmable gate arrays (FPGAs), amongst other possible examples.


2. Preferred Embodiments

Each of the above aspects is now described in detail, in reference to particular embodiments of the invention. The following notably describes preferred bit partitions (subsection 2.1), hardware processing apparatuses and memory devices (subsection 2.2), architectures of interleaved switched-capacitor analogue multipliers and adders (subsection 2.3), phase signals and 3-phase clocking schemes (subsection 2.4), and an example of high-level flow of operation (subsection 2.5).


2.1 Bit Partitions

The granularity of the bit partition of the N-bit weights and the M-bit input words can be asymmetric. That is, the average number of bits of the n groups may differ from the average number of bits of the m groups. In general, the n groups do not need to have a same number of bits, neither do the m groups. The bit distributions can possibly be optimized with respect to the desired application. That is, the present methods may attempt to optimize S20 bit cardinalities of the n groups of bits and the m groups of bits. Such an optimization may for example be performed with respect to computational precision, latency, and/or energy consumption. In some cases, one may want to favour precision (e.g., when accurate vector-matrix multiplications are needed), while applications resilient to precision (e.g., machine learning) may require optimization of latency or energy consumption. Joint optimizations (e.g., against both precision and energy consumption) may further be contemplated, depending on the end user needs.


Even if the groups do not need to have a same number of bits, simpler implementations are achieved by imposing each of the n groups to have a same number v of bits and, similarly, each of the m groups to have a same number u of bits. Still, v will preferably differ from u. For example, each of the n groups (assuming n>2) may include 2 bits, while the M-bit input words may each be processed as a single group of M bits, i.e., m=1. In that case, only two parameters must be optimized, i.e., v and u. Generalizing the above example, the bit partition may possibly be designed to decompose each of the N-bit weights into n groups of v bits, such that N=n×v, where v>2, while each of the M-bit input words is processed as a single group of M bits (m =1).


In practice, however, grouping the N bits (i.e., imposing n=1) allows an easier CU design, compared to grouping the M bits. In that case, each of the M-bit input words is decomposed into m groups of u bits, such that M=mx u, where μ>2, while each N-bit weight is processed as a single group of N bits (n=1). An example of such an implementation is shown in FIG. 8. As explained earlier, the analogue multiplications and additions are performed using CUs 1552, 1552a, analogue-to-digital converters (ADCs) 161, and shift-and-add circuitry (amounting to accumulation registers) 162. In the example of FIG. 8, the inputs are applied in m groups of μ bits, while each weight is operated as a single group of N bits, such that each ADC 161 operates m times, i.e., m conversions are needed at each calculation cycle. The digitized outputs are subsequently shifted according to the relevant bit positions and then accumulated to form the MAC results. The granular bit slicing approach reduces the required analogue compute SNR requirements.


As a final remark, it should be noted that the present methods may possibly use schemes that purposely drop bits, if necessary, independently of the chosen bit partition.


2.2 Hardware Processing Apparatuses and Memory Devices

As seen in FIGS. 2 and 3, each apparatus (or memory device) 10, 10a includes a crossbar array structure 15, 15a, as well as CUs 1552, 1552a, which are connected to the memory systems 157 of the cells 155, 155a (see also FIGS. 5 and 6). In addition, the apparatus (or memory device) may include an input unit 11, a readout circuitry 16, 16a, a near-memory digital processing 17, 17a, and an input/output (I/O) unit 18. As explained earlier, such components may be cointegrated with the array 15, 15a, to form an integrated device 10, 10a.


The example of device 10 shown in FIGS. 2 and 5 assume that the CUs 1552 are collocated with the memory systems 157 to which they are connected. Each memory system 157 includes N serially-connected memory elements 1551, e.g., SRAM elements, each storing a respective bit of the corresponding N-bit weight. An additional memory element is typically used to store the sign of the weight. In such embodiments the CUs 1552 form part, physically, of the cells 155. Thus, each CU 1552 is an IMCU, which performs n×m partial multiplications, in-memory, at each calculation cycle. As further seen in FIGS. 2 and 5, the MAC results are obtained S70-S74 via a readout circuitry 16, which is preferably co-integrated with the crossbar array structure 15 in the memory device 10. The readout circuitry 16 includes ADCs 161 that are connected to a respective column of the array 15 for converting the partial output signals as summed for each column into digital signals. Digital shift-and-adder circuits 162 complete the device 10. The circuits 162 are connected in output of respective ADCs 161 for shifting the partial values and adding the shifted values, in accordance with relevant bit positions thereof.


Every CU 1552 in a particular column produces n×m partial output signals that are individually summed in the analogue domain. That is, each of the n×m partial signals is summed with a corresponding one of the n×m partial signals produced by the previous CU in the same column (except, of course for the very first CU in that column). Accordingly, n×m partial, accumulated signals are obtained in output of each column. Such output signals are then converted to digital signals by a corresponding ADC 161, prior to being shifted and added via the component 162. The conversion, shift, and add operations, occur in output of each column. In less preferred variants, intermediate conversions may possibly be performed, e.g., at the level of each cell or each subset of cells. This, however, requires adding ADC converters in output of (subsets of) cells concerned, as noted earlier.


In variants, the CUS 1552a may form part of a near-memory processing unit 19, which is preferably co-integrated with the crossbar array structure 15a, to form a device 10a. In both cases, the ADCs 161 are connected to respective columns of the CUs, i.e., whether collocated with the memory systems or not. Thus, the operations remain the same, logically speaking, except that signals must be conveyed over slightly larger distances in the example of the device 10a. Operations performed in the near-memory processing unit 19 are still performed as analogue operations, contrary to operations performed by the near-memory digital processing unit 17, 17a.


As further seen in FIGS. 2 and 3, the near-memory digital processing unit 17, 17a is directly connected in output of the readout circuitry 16, 16a. The unit 17, 17a can be used to perform digital operations based on the MAC results obtained at the readout circuitry 16, 16a, which allows efficient computing for technical computing applications such as machine learning.


2.3 Interleaved Switched-Capacitor Analogue Multipliers and Adders

As illustrated in FIG. 6, each CU is configured as an interleaved switched-capacitor analogue multiplier and adder 1552. Each CU 1552 is connected to a respective memory system 157, which, in this example, includes serially connected SRAM memory elements 1551, storing respective bits.


Each CU 1552 includes charge adding units (capacitors in the example of FIG. 6), which are connected to the memory elements via switching logics. Each switching logic includes three switches in the example of FIG. 6. A column of CUs is serially connected to an output block 16, which includes an ADC 161 and a shift-and-adder 162. So, each cell 155 comprises several memory elements 1551, several switching logics, and several capacitors. Each sub-cell corresponds to a single memory element, which connects to a respective capacitor via a respective switching logic. Again, a cell is here considered to include a memory system 157 (i.e., including several memory elements). By contrast, in PA2, a cell is defined as corresponding to a single memory element.


The last memory element 1551 (corresponding to CN in FIG. 6) of the memory system 157 is configured to receive the signal encoding the sequence of M bits. I.e., it receives a stream of M bits via the source.


Each switching logic is configured such that the corresponding capacitor can be pre-charged or charged (e.g., from another capacitor) in response to the application of a clock signal at the switching logic. In addition, each switching logic can connect its respective capacitor to its respective memory element in response to another clock signal applied at the switching logic. Beyond the operation of the compute units shown in FIG. 6, which in the present case obey a certain bit partition logic, there are several differences between the design shown in FIG. 6 and the schematic proposed in FIG. 2 of PA1 and the schematics disclosed in PA2. First, the design proposed in FIG. 6 relies on readout circuitry 16 that involves both an ADC 161 and a shift-and-adder circuit 162, unlike PA1 and PA2. Moreover, the compute units also differ in that they do not require a switch for the accumulation that is driven by the signal ØACC in PA1, which basically saves one switch at every cross-point.


2.4 Phase Signals and 3-Phase Clocking Scheme

The CUs 1552, 1552a are operated thanks to a 3-phase clocking scheme, which is similar to the schemes presented in PA1 and PA2, subject to differences that are discussed now in detail. That is, the control signal scheme is here adapted to the bit partition used, as well as to the shift-and-add operations.


Several types of control signals can be involved. The CUs 1552, 1552a can notably be operated S60 thanks to first control signals, which include the 3-phase signals (noted Ø0, Ø1, and Ø2 below) used for implementing the 3-phase clocking scheme, which is similar to the scheme discussed in PA1.


In detail, and as seen in FIG. 7, the 3-phase clocking scheme spans a sequence of clock cycles, where the sequence actually decomposes into M sets of clock cycles, corresponding to sets i1, i2, . . . , iM in FIG. 7. Each of the M sets includes at least three clock cycles. The M sets are associated with respective M bits of the M-bit input words. The 3-phase signals (Ø0, Ø1, and Ø2 are repeatedly applied, M times, during the M sets of clock cycles. The 3-phase signals are successively applied during three clock cycles: only one phase signal of the 3-phase signals is applied during one clock cycle (i.e., a single cycle of the three clock cycles). In other words, a triplet of signal pulses is repetitively applied, in accordance with the M sets of clock cycles, but the three signals of each triplet are successively applied during a single set of clock cycles (corresponding to one of the M sets of clock cycles), meaning that only one pulse is applied during a single clock cycle, hence the name of “3-phase clocking scheme”.


Note, however, that the very first set of the M sets of clock cycles (corresponding to the set i1 in FIG. 7) may possibly require more than three clock cycles, to allow a steady state to be achieved, as also described in PA1. Yet, the subsequent sets of clock cycles consist of three cycles only. Thus, in such scenarios, the sets of clock cycles include at least three clock cycles; they mostly consist of three clock cycles only, except the very first set i1 of clock cycles.


In addition to the first control signals, second control signals may be used to obtain S70-S74 the MAC results. As reflected in the flow of FIG. 9, the second control signals are applied at step S70. Such signals are applied in phase with the 3-phase signals, so as to enable a synchronous operation of the CUs 1552, 1552a and the readout circuitry 16, 16a. “In phase” means that rising and falling edges of the second control signals occur in sync with either of the 3-phase signals Ø0, Ø1, and Ø2.


In embodiments, the second control signals includes signals noted ØMSB,add, ØMSB,rst, Øout,add, ØADC, Ørst, and ØSAA. These decompose into input-bit dependent signals (ØMSB,add, ØMSB,rst, and Øout,add) and group-dependent signals (ØADC, Ørst, and ØSAA). Note, ØADC corresponds to the signal noted ØSMP in PA1.


While the periodicity of the input-bit dependent signals matches that of the first control signals, the periodicity of the group-dependent control signals does differ. Specifically, the group-dependent control signals span a sequence of clock cycles, whose sequence decomposes into m sets of clock cycles. The example in FIG. 7 assumes m=M/2 and illustrates the group-dependency with the counter value mx that indicates the number of the group, which is currently processed.


The signals ØMSB,rst and ØMSB,add work as in PA1. They are applied to respectively discharge the capacitor CN (see FIGS. 6) to 0, when the input bit is 0, and perform charge-sharing with the previous capacitor CN-1 to generate a weight-proportional voltage on CN in accordance with an input bit of 1. The signal Øout,add is subsequently applied to accumulate the result on the last capacitor. The 3 input-dependent signals ØMSB,add, ØMSB,rst and Øout,add are only active after the CU is in steady-state, see the timing diagram (FIG. 4) of PA1.


In the present context, however, the second control signals include the additional, group-dependent signals ØADC, ØSAA, and Ørst. The latter include two types of activation signals, hereafter called first activation signals (noted ØADC) and second activation signals (noted ØSAA). The first activation signals ØADC are applied to activate S72 the ADCs 161, for the ADCs to convert the partial output signals into digital signals. The second activation signals ØSAA are used to activate S74 the digital shift-and-adder circuits 162, for the latter to shift the partial values and add the shifted values. The signal Ørst is used to reset the output capacitors' voltage VC,out to 0 (corresponding to the output capacitors noted Cout,1 to Cout,K in FIG. 6).


The operation of the activation signals is as follows. As seen in FIGS. 6 and 7, the activation signal ØADC is applied to the ADC 161, for it to convert current partial output signals into digital signals. The signal ØADC activates the ADC 161 taking into account the sampling clock of the ADC in output of each column. Next, ØSAA is applied to activate S74 the circuit 162, whereby the bit position (“Bit position” in FIG. 6) is fed to the element 162 to execute the shift-and-add operation. This position can be set as a bit shift, as noted in FIG. 8. In the example of FIG. 8, the bit-shift position (corresponding to the signal “Bit position” in FIG. 6) fed to the unit 162 ranges from 0 to (m-1)·u, because the bit partition is assumed to decompose each M-bit input word into m groups of u bits, where M=m×μ and μ≥2, while each N-bit weights is processed as a single group of N bits (n=1) in this example.


Every time the input bits of an input-bit group have been processed, the three signals ØADC, ØSAA, and Ørst are strobed one by one—for m groups of input bits this happens m times, after which the operation is completed. Note, the position of input bits and weight bits can be swapped for grouping weight bits instead of input words. As noted earlier, the signals ØSAA and Ørst are applied in-phase with the 3-phase signals.


Additional signals may be used, which are not shown in FIG. 7, starting with helper signals to generate other signals such as ØMSB,rst, see for example PA1.


As in PA1, some signals are common for the entire array 15, for instance the 3-phase signals Ø0, Ø1, Ø2, as well as Øout,add. Other signals, such as the signal pair of ØMSB,add and ØMSB,rst, are generated for each row depending on the input vector bits. The signals Ø0, Ø1, Ø2 are active throughout the whole operation. In variants, the signals Ø0, Ø1, Ø2 may occasionally be turned off, e.g., for a few cycles, when the input bits are 0, in order to save energy.


As evoked earlier, each memory system may include N serially-connected memory elements 1551, each storing a respective bit of the corresponding N-bit weight. The last memory element 1551 (corresponding to bit by and capacitor CN in FIG. 6) of each memory system 157 can be configured in the cell to receive a respective signal, which encodes a sequence of M bits. In variants, more than one element may receive the input-dependent signals. In other variants, the element that receives the input-dependent signals is not the last element but is the element that encodes the MSB that receives the input-dependent signals. However, the circuit is preferably configured in such a manner that the last memory element of a column receives the input-dependent signals, which allows an easier implementation.


Basically, each bit of the stream of M bits received at the last memory element is associated with a respective group of clock cycles, as per the 3-phase clocking scheme discussed above, which results in a sequence of M groups of cycles. By performing a successive and repetitive pipelined application of the 3-phase signals during a given one of the M groups, a phase signal is applied during each cycle of the given group. This allows the CU 1552 to map digital values stored in each memory element into a word proportional voltage, and to transfer the word proportional voltages of the capacitors C1 to CN-1 to the last capacitor CN such that the voltage VCN across the last capacitor CN is the analogue voltage that corresponds to the N-bit word scaled by the bit associated with that group. The output block 16 adequately reconstructs the expected value based on the bit positions corresponding to the groups used in the bit partition. As explained earlier, each CU 1552 preferably comprises N charge adding units, which are connected to respective memory elements 1551 via respective switching logics, see FIG. 6. For example, assume that the chosen bit partition decomposes each M-bit input word into m groups of μ bits (i.e., M=m×μ, μ≥2) and that each N-bit weight is processed as a single group of N bits (n=1). In that case, the m groups impact the application of the signals ØADC, Ørst, and ØSAA. Such signals are successively applied during the clock cycles to generate a voltage across the charge adding unit of the last memory element (corresponding to bN and CN), which corresponds to the N-bit word scaled by a bit value of a respective one of the μ bits within each of the m groups.


2.5 Preferred Flow

A preferred flow is shown in FIG. 9. First, a memory device with a crossbar array is provided at step S10. Parameters of an optimal bit partition are loaded at step S20, e.g., in accordance with a client request (not shown) aiming at performing a given computation task involving a matrix-vector product. Bit partitions are assumed to have already been optimized against a variety of applications. At step S30, weights (matrix coefficients) are loaded in the memory systems 157. An input vector of K components is selected at step S40. Corresponding input signals are applied at step S50, which encode the vector components (input words). Meanwhile, the CUs are operated S60 according to a 3-phase clocking scheme as described above. Control signals are concurrently triggered at step S70 for readout purposes. These notably cause the ADCs 161 to convert S72 signals obtained for each column and the components 162 to shift and add S74 the digital values obtained, all these in accordance with the loaded bit partition parameters. Optionally, a near-memory digital processing unit is used to further process S80 the MAC results. Any intermediate result can be locally stored S90 or returned. The above steps can be repeated for any required matrix-vector calculation.


3. Final Remarks

Computerized devices 10, 10a and systems 1 can be suitably designed for implementing embodiments of the present invention as described herein. In that respect, it can be appreciated that the methods described herein are essentially non-interactive, i.e., automated. Automated parts of such methods can be implemented in hardware only, or as a combination of hardware and software. In exemplary embodiments, automated parts of the methods described herein are implemented in software, which is executed by suitable digital processing devices. In particular, the methods described herein may involve executable programs, scripts, or, more generally, any form of executable instructions, be it to instruct to perform core computations at the devices 10, 10a. The required computer readable program instructions can for instance be downloaded to processing elements from a computer readable storage medium, via a network, for example, the Internet and/or a wireless network. However, all embodiments described here involve analogue computations performed thanks to crossbar array structures and compute units described in sections 2 and 3.


While the present invention has been described with reference to a limited number of embodiments, variants, and the accompanying drawings, it will be understood by those skilled in the art that various changes may be made, and equivalents may be substituted without departing from the scope of the present invention. In particular, a feature (device-like or method-like) recited in a given embodiment, variant or shown in a drawing may be combined with or replace another feature in another embodiment, variant, or drawing, without departing from the scope of the present invention. Various combinations of the features described in respect of any of the above embodiments or variants may accordingly be contemplated, that remain within the scope of the appended claims. In addition, many minor modifications may be made to adapt a particular situation or material to the teachings of the present invention without departing from its scope. Therefore, it is intended that the present invention is not limited to the particular embodiments disclosed, but that the present invention will include all embodiments falling within the scope of the appended claims. In addition, many other variants than explicitly touched above can be contemplated. For example, other types of memory elements can be contemplated.

Claims
  • 1. A method of processing data, the method comprising: providing a memory device having a crossbar array structure including K×L cells interconnecting K rows and L columns, the cells including respective memory systems storing respective N-bit weights, wherein the memory systems are connected to respective compute units, which are configured as interleaved switched-capacitor analogue multipliers and adders; andsynchronously applying input signals encoding respective M-bit input words to respective ones of the K rows, operating the compute units according to a 3-phase clocking scheme, and obtaining multiply-accumulate results for each of the L columns, where K≥2, L≥2, N≥2, and M≥2, whereinthe 3-phase clocking scheme is set to perform n×m partial multiplications, in an analogue domain, according to a bit partition decomposing each of the N-bit weights into n groups of bits and each of the M-bit input words into m groups of bits, wherein each of the n groups and the m groups includes at least one bit, but at least one of the n groups and/or the m groups includes at least two bits, whereby N+M>n +m≥3, so as to obtain n×m partial output signals, andthe multiply-accumulate results are obtained by summing the partial output signals obtained by the compute units for each of the L columns, converting the summed output signals into digital signals encoding partial values, shifting the partial values according to corresponding bit positions set in accordance with the bit partition, and adding the shifted values.
  • 2. The method according to claim 1, wherein: a granularity of the bit partition of the N-bit weights and the M-bit input words is asymmetric, whereby an average number of bits of the n groups differs from an average number of bits of the m groups.
  • 3. The method according to claim 2, wherein: each of the n groups has a same number v of bits and each of the m groups has a same number u of bits, where v differs from μ.
  • 4. The method according to claim 3, wherein the bit partition is designed so as to either decompose: each of the N-bit weights into n groups of v bits, such that N=n×v, where v≥2, and each of the M-bit input words into a single group of M bits, whereby m=1, or each of the M-bit input words into m groups of u bits, such that M=m×μ, where μ≥2, and each of the N-bit weights into a single group of N bits, whereby n=1.
  • 5. The method according to claim 4, wherein the bit partition is designed to decompose each of the M-bit input words into m groups of μ bits, such that M=m×μ, where μ≥2, and each of the N-bit weights into a single group of N bits, whereby n=1.
  • 6. The method according to claim 1, wherein: the compute units are collocated with the respective memory systems to which they are connected and form part of the respective cells, whereby the n×m partial multiplications are performed in-memory in the memory device.
  • 7. The method according to claim 1, wherein: the multiply-accumulate results are obtained via a readout circuitry, which includes: analogue-to-digital converters connected to respective columns of the compute units for converting the partial output signals as summed for each of the L columns into the digital signals; anddigital shift-and-adder circuits connected in output of respective ones of the analogue-to-digital converters for shifting the partial values and adding the shifted values.
  • 8. The method according claim 7, wherein: the compute units are operated thanks to first control signals, which include 3-phase signals for implementing the 3-phase clocking scheme, andthe multiply-accumulate results are obtained by applying second control signals, which are in phase with the 3-phase signals, so as to enable a synchronous operation of the compute units and the readout circuitry, the second control signals including: first activation signals to activate the analogue-to-digital converters for converting the partial output signals, andsecond activation signals to activate the digital shift-and-adder circuits for shifting the partial values and adding the shifted values.
  • 9. The method according claim 1, wherein: the 3-phase clocking scheme spans a sequence of clock cycles, wherein the sequence decomposes into M sets of clock cycles associated with respective M bits of the M-bit input words,the 3-phase signals are repeatedly applied, M times, during the M sets of clock cycles,each of the M sets includes three clock cycles, during which the 3-phase signals are successively applied, such that only one phase signal of the 3-phase signals is applied during a single one of the three clock cycles.
  • 10. The method according to claim 9, wherein: each memory system of the memory systems of each cell of the K×L cells consists of N serially-connected memory elements, each storing a respective bit of one of the N bits of the N-bit weights that is stored in said each cell, wherein a last memory element of the memory elements of said each memory system is configured to receive a respective signal of the applied signals, the respective signal encoding a sequence of M bits.
  • 11. The method according to claim 10, wherein: each of the compute units comprises N charge adding units, which are connected to respective ones of the N serially-connected memory elements via respective switching logics.
  • 12. (canceled)
  • 13. The method according to claim 1, wherein: the method further comprises optimizing bit cardinalities of the n groups of bits and the m groups of bits with respect to computational precision, latency, and/or energy consumption.
  • 14. A hardware processing apparatus, comprising a memory device having a crossbar array structure including K ×L cells interconnecting K rows and L columns, the cells including respective memory systems storing respective N-bit weights,K×L compute units) connected to respective ones of the memory systems of the K×L cells, wherein the compute units are configured as interleaved switched-capacitor analogue multipliers and adders; andan electronic circuit configured to synchronously apply input signals encoding respective M-bit input words to respective ones of the K rows, operate the compute units according to a 3-phase clocking scheme, and obtain multiply-accumulate results for each of the L columns, where K≥2, L≥2, N≥2, and M≥2, wherein the electronic circuit is further configured toset the clocking scheme to perform n×m partial multiplications, in an analogue domain, according to a bit partition decomposing each of the N-bit weights into n groups of bits and each of the M-bit input words into m groups of bits, wherein each of the n groups and the m groups includes at least one bit, but at least one of the n groups and/or the m groups includes at least two bits, whereby N+M>n+m≥3, so as to obtain n×m partial output signals, andobtain the multiply-accumulate results by summing the partial output signals obtained by the compute units for each of the L columns, converting the summed output signals into digital signals encoding partial values, shifting the partial values according to corresponding bit positions set in accordance with the bit partition, and adding the shifted values.
  • 15. The hardware processing apparatus according to claim 14, wherein: the compute units are collocated with the memory systems to which they are connected and form part of the respective cells, whereby the n×m partial multiplications are performed in-memory, in operation.
  • 16. The hardware processing apparatus according to claim 14, wherein: the apparatus further comprises a near-memory processing unit, where the latter includes the compute units.
  • 17. The hardware processing apparatus according to claim 14, wherein; the electronic circuit includes a readout circuitry, which comprises analogue-to-digital converters connected in output of respective columns of the compute units, to convert the n×m partial output signals into the digital signals that encode said partial values, in operation; anddigital shift-and-adder circuits connected in output of respective ones of the analogue-to-digital converters to shift the partial values according to corresponding bit positions set in accordance with the bit partition, and add the shifted values, in operation.
  • 18. The hardware processing apparatus according to claim 17, wherein: each of the memory systems of the cells includes serially connected memory elements, the latter designed to store respective bits of a respective one of the N-bit weights, in operation.
  • 19. The hardware processing apparatus according to claim 18, wherein the electronic circuit further includes: an input unit configured to apply said input signals; andcontrol components configured to operate the compute units by applying first control signals that include 3-phase signals for implementing the 3-phase clocking scheme, andthe readout circuitry to obtain the multiply-accumulate results by applying second control signals in phase with the 3-phase signals, wherein, in operation, the second control signals include first activation signals to activate the analogue-to-digital converters for converting the partial output signals, andsecond activation signals to activate the digital shift-and-adder circuits for shifting the partial values and adding the shifted values.
  • 20. (canceled)
  • 21. The hardware processing apparatus according to claim 17, wherein the apparatus further includes: a near-memory digital processing unit, wherein the near-memory digital processing unit is connected in output of the readout circuitry and configured to perform operations based on the multiply-accumulate results obtained at the readout circuitry.
  • 22. A computing system comprising: one or more hardware processing apparatuses;a memory unit; anda general-purpose processing unit connected to the memory unit to read data from, and write data to, the memory unit, wherein: each of the hardware processing apparatuses is configured to read data from, and write data to, the memory unit, andthe general-purpose processing unit is configured to: map a given computing task to vectors and weights,instruct to store said weights as N-bit weights in cells of any of the hardware processing apparatuses, andinstruct to apply input signals encoding vector components of such vectors as M-bit input words to rows of any of the hardware processing apparatuses, so as to perform such a computing task, in operation.
  • 23. (canceled)
PCT Information
Filing Document Filing Date Country Kind
PCT/EP2022/059113 4/6/2022 WO