This application claims the priority benefit of French patent application number 18/71578, the content of which is hereby incorporated by reference in its entirety to the maximum extent allowable by law.
The present disclosure relates to the field of memory circuits. It more particularly aims at a memory circuit capable of implementing calculation operations.
A memory circuit capable not only of storing data, but also of executing, in situ, during accesses to the memory content, a number of logic and/or arithmetic operations having data stored in the circuit as operands, has already been provided in European patent application EP3252774 (DD16812/B14843), in French patent application No 17/62470 (DD17955/B16064) filed on Dec. 26, 2017, and in French patent application No 17/62468 (DD17885/B16288) filed on Dec. 19, 2017. Such a circuit comprises, as in a conventional memory circuit, a plurality of elementary cells arranged in rows and columns, and a control circuit capable of implementing data read or write operations from or into rows of the array. Unlike a conventional memory circuit where only one row of the array can be selected at a time during a read operation, the control circuit is capable of simultaneously selecting in read mode a plurality of rows of the array to perform an operation having data contained in the selected rows as operands.
It would be desirable to at least partly improve certain aspects of such a memory circuit.
Thus, an embodiment provides a memory circuit comprising:
According to an embodiment, the control circuit is capable of implementing a shuffle operation comprising a step of reading configuration data stored inside of the memory circuit and of copying the data into the configuration register of the shuffle circuit.
According to an embodiment, the internal control circuit is capable of implementing a shuffle operation comprising:
According to an embodiment, the shuffle operation further comprises:
According to an embodiment, the memory circuit further comprises, in addition to the array of elementary storage cells, an additional memory intended to store data of configuration of the shuffle circuit.
According to an embodiment, the shuffle operation further comprises copying the data read from the additional memory into the configuration register of the shuffle circuit.
According to an embodiment, the additional memory is a non-volatile memory.
According to an embodiment, the shuffle circuit comprises a plurality of elementary shuffle cells, each comprising two data inputs e1 and e2, two data outputs s1 and s2, and one configuration input c.
According to an embodiment, the shuffle circuit comprises (K/2)*(2*log2(K)−1) elementary shuffle cells arranged in a Benes network, where K is an integer designating the dimension of the input register and of the output port of the shuffle circuit.
According to an embodiment, the shuffle circuit comprises 16 elementary shuffle cells arranged in 5 rows, the rows of rank 1=1 to 1=3 each comprising 4 elementary cells and the rows of rank 1=4 to 1=5 each comprising 2 elementary cells, the input register and the output port of the shuffle circuit being of dimension 8, and 1 being an integer ranging from 1 to 5.
According to an embodiment, the memory circuit further comprises an input-output circuit configurable to couple the read bit lines of the array to the input register of the shuffle circuit and/or to couple the write bit lines of the array to the output port of the shuffle circuit.
According to an embodiment, the memory circuit further comprises a calculation circuit capable of implementing logic or arithmetic operations having data stored in the array of elementary storage cells of the memory circuit as operands.
According to an embodiment, the input-output circuit is further configurable to couple the read bit lines of the array to an input register of the calculation circuit and/or to couple the write bit lines of the array to an output port of the calculation circuit.
The foregoing and other features and advantages will be discussed in detail in the following non-limiting description of specific embodiments in connection with the accompanying drawings.
bis is another illustration of an embodiment of the memory circuit of
bis schematically shows another example of a data processing method capable of being implemented by a memory circuit according to an embodiment; and
The same elements have been designated with the same reference numerals in the different drawings. In particular, the structural and/or functional elements common to the different embodiments may be designated with the same reference numerals and may have identical structural, dimensional, and material properties.
For clarity, only those steps and elements which are useful to the understanding of the described embodiments have been shown and are detailed. In particular, the complete forming of a memory circuit capable of implementing calculation operations by simultaneous activation in read mode of a plurality of rows of an array of elementary cells has not been detailed, the described embodiments being compatible with known architectures of such circuits, and particularly those described in the above-mentioned patent applications.
Throughout the present disclosure, the term “connected” is used to designate a direct electrical connection between circuit elements with no intermediate elements other than conductors, whereas the term “coupled” is used to designate an electrical connection between circuit elements that may be direct, or may be via one or more other elements.
The terms “about”, “approximately”, “substantially”, and “in the order of” are used herein to designate a tolerance of plus or minus 10%, preferably of plus or minus 5%, of the value in question.
In the following description, references to high- and low-level signals should be interpreted in relative fashion, as corresponding to two different states of the binary signals processed by the described circuits. As an example, the high level of the signals corresponds to potentials in the order of a high power supply potential VDD of the described circuits (for example, equal to VDD plus or minus 0.5 V), and the low level of the signals corresponds to potentials in the order of a low power supply potential GND of the described circuits (for example, equal to GND plus or minus 0.5 V).
The memory circuit 100 of
Each elementary cell 10 is capable of storing a data bit. The elementary cells 10 of a same column of the array are connected to a same data output conductive track or read bit line RBL of the circuit, and to a same data input conductive track or write bit line WBL of the circuit, it being understood that the elementary cells 10 of different columns are connected to different read bit lines RBL and to different write bit lines WBL. In other words, for each column of rank j of the array, where j is an integer in the range from 0 to N−1, memory circuit 100 comprises a read bit lines RBL<j> and a write bit line WBL<j> specific to the column, interconnecting the column cells. The read bit lines RBL<j> and the write bit lines WBL<j> extend in the column direction of the array, that is, vertically in the orientation of
Memory circuit 100 further comprises a row selection circuit RS, for example, arranged at one end of the rows of the array.
Memory circuit 100 further comprises a control circuit CTRL capable of receiving instructions originating from a device external to the memory circuit, and of accordingly controlling the elementary cells 10 of the memory circuit via row selection circuit RS and/or via input-output circuit IO.
In this example, the memory circuit is capable not only of storing data, but also of executing, in situ, during accesses to the memory content, a number of logic and/or arithmetic operations having data stored in the memory circuit as operands. The results of the calculations may be transmitted to circuits external to the memory circuit and/or be written back into the memory circuit without transiting through circuits external to the memory circuit, and particularly over a data bus external to the memory circuit.
In the example of
Elementary cell 10 of
In a memory circuit of the type described in relation with
As an example, the read bit line RBLT and the write bit line WBLT of cell 10 of
To read from an elementary cell 10 of the type described in relation with
To implement calculation operations, in a memory circuit formed of elementary cells 10 of the type described in relation with
As an illustration, it is considered that two elementary cells 10 of a same column are simultaneously activated in read mode via their ports RPT and RPF. To achieve this, after having precharged to a high level the read bit lines RBLT and RBLF of the column and set to a low level the reference conductive tracks VGNDT and VGNDF of the column, the selection transistors T4 and T6 of the two selected cells are simultaneously turned on, via the control conductive tracks RWLT and RWLF of the corresponding rows. The binary values stored on nodes BLTI of the two considered cells are respectively called A and B. Complementary binary values NA and NB are then stored on the respective nodes BLFI of the two cells. The conductive output track RBLT of the column remains at its high precharge level only if the two values A and B are at a low level. Thus, the level read from output track RBLT of the column at the end of the read operation corresponds to result NA.NB of a logic NOR calculation operation between the two selected memory cells. Further, the conductive output track RBLF of the column remains at its high precharge level only if the two values A and B are at a high level. Thus, the level read from output track RBLF of the column at the end of the read operation corresponds to result A.B of a logic AND calculation operation between the two selected memory cells.
By simultaneously activating in read mode a number greater than two of cells of a same column, the above-mentioned calculations operations may be implemented with a number of operands greater than 2. Further, by simultaneously activating in read mode a plurality of memory circuit columns, the calculation operations may be implemented on words comprising a plurality of bits.
Thus, basic calculation operations may be directly implemented in the memory, without needing to have the data transit through an arithmetic and logic unit external to the circuit
To enable to write back the result of a calculation operation without having the calculated data transit through circuits external to the memory circuit, and particularly over a data bus external to the memory circuit, the input-output circuit IO of the memory circuit may comprise an internal redirection circuit (not detailed) enabling to couple the read bit line(s) of each column of the memory circuit to a data input node of the column and/or of another column of the array, for example, to the conductive track WBLT and/or to the conductive track WBLF of the column or of another column of the array in an architecture of the type described in relation with
To implement more calculation operations, for example, arithmetical operations, the operand data may be transmitted to calculation circuit 20. For this purpose, the internal redirection of input-output circuit IO may be configured to couple the read bit lines of the memory circuit to input nodes of the data input register 22 of calculation circuit 20. Once the operation has been performed, the internal redirection circuit of input-output circuit IO may be configured to couple the write bit lines of the memory circuit to output nodes of the data output port 24 of calculation circuit 20 to enable to write back the result of the calculation operation without having the calculated data transit through circuits external to the memory circuit, and particularly over a data bus external to the memory circuit.
The memory circuit 200 of
Memory circuit 200 further comprises a shuffle circuit 30. Shuffle circuit 30 comprises a data input register 32 of dimension K, that is, capable of simultaneously receiving (in parallel) and of storing K data bits, and a data output port 34 of same dimension K, that is, capable of simultaneously delivering (in parallel) K data bits. As an example, dimension K of shuffle circuit 30 is equal to number N of columns of the storage cell array of the memory circuit. Shuffle circuit 30 further comprises a configuration register 36. The shuffle circuit is capable of delivering on its output port the K bits stored in its data input register, shuffled according to a shuffle operation defined according to the state of its configuration register 36.
A shuffle operation here means that the K input bits of circuit 30 are copied at the output of circuit 30, but in a different order. The total number of possible shuffle operations for a K-bit input vector is equal to K! (K factorial), that is K*(K−1)*(K−2)* . . . *2*1. Shuffle circuit 30 is for example configurable, via its configuration register 36, to implement any of the K! possible shuffle operations.
Data input register 32, the data output port 34 and the configuration register 36 of shuffle circuit 30 are coupled to the input-output circuit IO of the memory circuit. To implement a shuffle operation by means of circuit 30, control circuit CTRL orders the reading of input data from the storage array of the memory circuit, the writing of the data, via input-output circuit IO, into the data input register 32 of shuffle circuit 30, the reading of configuration data from the storage array of the memory circuit, and the writing of the data, via input-output circuit IO, into the configuration register 36 of shuffle circuit 30. For this purpose, the internal redirection circuit of input-output circuit IO may be configured to couple the read bit lines of the memory circuit to input nodes of the data input register 32 of shuffle circuit 30 and/or to input nodes of the configuration register 36 of shuffle circuit 30. The result of the shuffle operation is delivered to the output port 34 of shuffle circuit 30, and may be directly written back into the storage array of the memory circuit via input-output circuit IO, without transiting through circuits external to the memory circuit. For this purpose, the internal redirection circuit of input-output circuit IO may be configured to couple the write bit lines of the memory circuit to output nodes of the data output port 34 of shuffle circuit 30. The duration of the actual shuffle operation is a constant or substantially constant duration (where the propagation delays may of course slightly vary according to the power, voltage, and/or temperature conditions and, by a lesser extent, according to the respective contents of the registers according to the routing circuit architecture). This duration is for example smaller than or equal to a cycle of a clock rating the logic or arithmetic operations performed in the memory circuit. For this purpose, shuffle circuit 30 is a circuit with a parallel architecture, of configurable routing circuit type. This enables to considerably ease the compilation of a program intended to be executed on an integrated circuit comprising at least one processor and one so-called intelligent memory circuit, of the type described in relation with
In the example of
Each elementary shuffle cell 40 comprises two input nodes e1 and e2, two output nodes s1 and s2, and one configuration node c. Each elementary cell 40 is capable of copying on its output nodes s1 and s2 the signals respectively applied to its input nodes e2 and e1, according to the state of the signal applied to its configuration node c.
The K input nodes e1, e2 of the cells 40 of the row of rank 1=1,1 being an integer ranging from 1 to 2*log2(K)−1, are coupled, preferably connected, respectively to the K storage nodes of the data input register 32 of the shuffle circuit. The K output nodes s1, s2 of the cells 40 of the row of rank 1=2*log2(K)−1 are coupled, preferably connected, respectively to the K output nodes of the output port 34 of the shuffle circuit. In this example, configuration register 36 comprises (K/2)*(2*log2(K)−1) storage nodes respectively coupled to the configuration nodes c of the (K/2)*(2*log2(K)−1) cells 40 of the shuffle circuit.
In each row of rank 1=1 to 1=log2(K)−1, the cells of the row are distributed in groups of n=21 neighboring cells having their output nodes s1, s2 coupled to the input nodes e1, e2 of the cells of the next row (the row of rank 1+1) according to a cross coupling defined as follows:
The array is symmetrical with respect to the central row of rank 1=log2(K), that is, in each row of rank 1=log2(K) to 1=2*log2(K)−2, the cells of the row are distributed in groups of n=2(2*log2(K)-1-1) neighboring cells having their output nodes s1, s2 coupled to the input nodes e1, e2 of the cells of the next rank (row (1+1)) according to a cross coupling defined as follows:
The cell 40 of
The input node e1 of cell 40 is connected to the input node a1 of multiplexer mux1 and to the input node a2 of multiplexer mux2. The input node e2 of cell 40 is connected to the input node a2 of multiplexer mux1 and to the input node a1 of multiplexer mux2. The configuration node c of cell 40 is connected to the configuration nodes c1 of multiplexers mux1 and mux2.
According to the state of the signal applied to its configuration node c, cell 40 delivers on its node s1 the signal applied to its node e1 and delivers on its node s2 the signal applied to its node e2, or delivers on its node s1 the signal applied to its node e2 and delivers on its node s2 the signal applied to its node e1.
In the example of
The K input nodes e1, e2 of the cells 40 of the row of rank 1=1 are coupled, preferably connected, respectively to the K storage node of the data input register 32 of the shuffle circuit. In this example, configuration register 36 comprises 16 storage nodes respectively coupled to the configuration nodes c of the 16 cells 40 of the shuffle circuit.
As in the example of
Conversely to the example of
An advantage of the shuffle circuit of
More generally, other arrangements of shuffle circuits may be provided, using a number of elementary cells capable of ranging from log2(K!+1) to (K/2)*(2*log2(K)−1).
The memory circuit 300 of
In the example of
In each module MOD of additional memory 50, each of the p rows of the memory module stores a configuration of the corresponding row of shuffle circuit 30, corresponding to one of the K! possible configurations of shuffle circuit 30. A row selection circuit (not detailed in
As compared with the memory circuit of
Additional memory 50 is for example a non-volatile memory, for example, a resistive memory, for example, of RRAM or ReRAM type, formed above a semiconductor substrate, for example, made of silicon, having the main memory and the calculation circuit 20 and the shuffle circuit 30 (Shuffle Block) of the memory circuit formed inside and on top of it. This provides a significant density gain since the storage of the shuffle circuit configuration data then causes no increase in the total surface area of the memory circuit.
bis schematically illustrates an example of a memory circuit of the type described in relation with
As illustrated in view (a) of
As illustrated in view (c) of
As an example, a memory circuit of the type described in relation with
More particularly, the AES standard comprises a step called SubBytes, during which input data, in the form of a vector of a plurality of words of 8 bits each (or bytes), is submitted to a series of a plurality of successive shuffles, selected from a set of a plurality of series of shuffle operations, called Sbox. The different configurations of shuffle circuit 30, corresponding to the different series of shuffle operations or Sbox, may be stored in the memory circuit, in the main memory in the example of
The AES standard further comprises a step called ShiftRows, during which the second, third, and fourth rows of an array of 4×4 bytes are circularly shifted by 1 byte, by two bytes, and by three bytes. To perform this operation, the data of the input array are for example arranged in the form of a 16-byte row vector corresponding to a concatenation of four rows of the array. This vector is applied to the input of shuffle circuit 30. Shuffle circuit 30 is configured to deliver a 16-byte output vector corresponding to a concatenation of the first row of the input array, of the second row of the input array circularly shifted by 1 byte, of the third row of the input array circularly shifted by 2 bytes, and of the fourth row of the input array circularly shifted by three bytes. The corresponding configuration of shuffle circuit 30 may be stored in the memory circuit, in the main memory in the example of
The AES standard further comprises a key expansion function. Such a function enables to generate, from an initial 128-bit key, an expanded key formed of 10 round keys of 128 bits each.
bis schematically illustrates the algorithm. View (a) of
As illustrated in view (c) of
The measurements performed by the applicant have shown that by using a memory circuit of the above-described type, a gain by a factor 12.4 in terms of execution time and a gain by a factor 5.6 in terms of electric power consumption may be obtained with respect to a conventional implementation, by means of a microprocessor, for example, a processor of Cortex M0+ type.
As a variation, a memory circuit of the type described in relation with
The JPEG compression algorithm particularly comprises a step of color transformation from the RGB (red/green/blue) domain to the YUV (luminance/chrominance) domain. For this purpose, for each pixel of the image, a vector of three values, respectively corresponding to the R, G, and B components of the pixel, is multiplied by an array of 3×3 predetermined coefficients. The result of this multiplication is a vector of three values, respectively corresponding to the Y, U, and V components of the transformed image.
In the illustration of
In the example of implementation of
The coefficients of the transformation array are stored in the form of a row vector L2 in a second row of the memory circuit.
In this example, the color transformation operation is entirely implemented within the memory circuit, in three cycles each comprising a multiply and add operation. At each cycle, operation L3=L1*L2+L3 is executed. In other words, at each cycle, each of the values L3(q) of vector L3 is incremented by value L1(q)*L2(q), where q is an integer ranging from 1 to 9 designating the position of the data in vectors L1, L2, and L3. Such a multiply and add operation may be implemented by the calculation circuit 20 of the memory circuit. After each multiply and add operation and before the next multiply and add operation, the values of row vector L2 are shuffled to perform a circular shift by three values (rightwards in the orientation of
Thus, within three cycles, the color transformation of three pixels of the image can be achieved. In practice, it is possible to simultaneously process a much higher number of pixels, by increasing the dimensions of vectors L1, L2, and L3, and by repeating a plurality of times in vector L2 the series of transformation coefficients p0, p1, p2, p3, p4, p5, p6, p7, and p8. The three configurations of the shuffle circuit necessary for the implementation of this step may be stored in the memory circuit, in the main memory in the example of
The JPEG compression algorithm further comprises a step of sub-sampling of the image at the YUV format. This step may be implemented by means of shuffle circuit 30. More particularly, shuffle circuit 30 may be used to arrange back in order the data of a memory row containing the complete YUV representation of the image, to only keep a portion of the image, as defined by the JPEG standard. The corresponding configuration of the shuffle circuit may be stored in the memory circuit, in the main memory in the example of
The JPEG compression algorithm further comprises a step of calculating a discrete cosine transform (DCT) of the sub-sampled YUV image. This step may be entirely implemented within the memory circuit, by using calculation circuit 20 and shuffle circuit 30.
The JPEG compression algorithm further comprises a step of entropic coding of blocks of 8*8 values generated at the end of the DCT calculation step. During this step, advantage may advantageously be taken of shuffle circuit 30 to arrange back in order the blocks of 8*8 values resulting from the DCT calculation step.
Thus, all or most of the steps of a JPEG compression algorithm may be directly implemented within a memory circuit of the type described in relation with
As a variation, a memory circuit of the type described in relation with
An operation of array multiplication R=A*B is performed, where A and B are arrays of 4 rows and 4 columns defined as follows:
To implement operation R=A*B by means of a memory circuit of the type described in relation with
The transpose BT of array B, defined as follows:
can then be calculated by using the shuffle circuit 30 of the memory circuit.
The row vectors containing array A and array BT may be multiplied point by point by using the calculation circuit 20 of the memory circuit.
The accumulation of the results of sub-products (a00*b00, a01*b10, a02*b20, a03*b30, a10*b01, a11*b11, a12*b21, a13*b31, a20*b02, a21*b12, a22*b22, a23*b32, a30*b03, a31*b13, a32*b23, a33*b33) to calculate 4*4 coefficients of array R may be iteratively performed by the calculation circuit 20 of the memory circuit.
Thus, all or part of the steps of an artificial neural network inference algorithm may be directly implemented within a memory circuit of the type described in relation with
As an example, by using a memory circuit of the above-described type, two array multiplications may be performed within 6 memory cycles. The number of operations is 32 per cycle and per array (16 additions and 16 multiplications), that is, 64 operations per cycle. In terms of data movements, a memory instruction should be provided for the rotation of the array and four memory instructions should be provided for the implementation of the two array multiplications, that is, 5 memory accesses, to be compared with 64 read accesses for a standard implementation via a microprocessor.
More generally, the above-described example of implementation of an array multiplication operation may be adapted to any application comprising one or a plurality of array multiplication operations.
Various embodiments and variations have been described. It will be understood by those skilled in the art that certain features of these various embodiments and variations may be combined, and other variations will occur to those skilled in the art. In particular, the described embodiments are not limited to the examples of application of the memory circuit mentioned in the present description. Further, the described embodiments are not limited to the examples of implementation of the shuffle circuit described in relation with
Such alterations, modifications, and improvements are intended to be part of this disclosure, and are intended to be within the spirit and the scope of the present invention. Accordingly, the foregoing description is by way of example only and is not intended to be limiting. The present invention is limited only as defined in the following claims and the equivalents thereto.
Number | Date | Country | Kind |
---|---|---|---|
1871578 | Nov 2018 | FR | national |
Number | Name | Date | Kind |
---|---|---|---|
5325500 | Bell et al. | Jun 1994 | A |
9432298 | Smith | Aug 2016 | B1 |
9761300 | Willcock | Sep 2017 | B1 |
9898251 | Dupont De Dinechin et al. | Feb 2018 | B2 |
20020124033 | Takahashi | Sep 2002 | A1 |
20060143428 | Noda | Jun 2006 | A1 |
20090141566 | Arsovski | Jun 2009 | A1 |
20120265951 | Lin | Oct 2012 | A1 |
20130262358 | Heliot | Oct 2013 | A1 |
20160189769 | Jeloka et al. | Jun 2016 | A1 |
20190088307 | Burgess et al. | Mar 2019 | A1 |
20200005859 | Chen | Jan 2020 | A1 |
Number | Date | Country |
---|---|---|
3 252 774 | Dec 2017 | EP |
Entry |
---|
Preliminary Search Report for French Application No. 1871578, dated Sep. 5, 2019. |
[No Author Listed] Intel® Itanium® Architecture Software Developer's Manual: vol. 3: Intel® Itanium® Instruction Set Reference, Revision 2.3. May 2010:420 pages. |
Agarwal et al., An efficient pixel-shuffling based approach to simultaneously perform image compression, encryption and steganography. IJCSMC. May 2013;2(5):376-85. |
Akyel et al., DRC 2: Dynamically reconfigurable computing circuit based on memory architecture. 2016 IEEE International Conference on Rebooting Computing (ICRC). Oct. 17, 2016:1-8. |
Anderson, Linear pixel shuffling for image processing: an introduction. Journal of Electronic Imaging. Apr. 1993;2(2):147-55. |
Bene{hacek over (s)}, Optimal rearrangeable multistage connecting networks. Bell system technical journal. Jul. 1964;43(4):1641-56. |
Bene{hacek over (s)}, Permutation groups, complexes, and rearrangeable connecting networks. Bell System Technical Journal. Jul. 1964;43(4):1619-40. |
Brunie et al. Conception d'une matrice reconfigurable pour coprocesseur fortement couplé. Symposium en Architectures nouvelles de machines. Jan. 2013:l-14. |
Charles et al., Smart instruction codes for in-memory computing architectures compatible with standard sram interfaces. Design, Automation and Test in Europe. Mar. 2018:l-7. |
Daemen et al., The design of Rijndael: AES—the advanced encryption standard. Springer Science & Business Media. Mar. 9, 2013. 128 pages. |
Hilewitz et al., Bit matrix multiplication in commodity processors. IEEE 2008 International Conference on Application-Specific Systems, Architectures and Processors. Jul. 2, 2008:7-12. |
Hilewitz, A new basis for shifters in general-purpose processors for existing and advanced bit manipulations. IEEE Transactions on Computers. Aug. 2009;58(8):1035-48. |
Kooli et al., Software platform dedicated for in-memory computing circuit evaluation. Proceedings of the 28th International Symposium on Rapid System Prototyping: Shortening the Path from Specification to Prototype. IEEE ACM. Oct. 2017:43-9. |
Opferman et al., On a class of rearrangeable switching networks part I: Control algorithm. The Bell System Technical Journal. May 1971;50(5):1579-600. |
Wallace, The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics. Feb. 1992;38(1):xviii-xxiv. |
FR1871758, Sep. 5, 2019, Preliminary Search Report. |
Number | Date | Country | |
---|---|---|---|
20200160905 A1 | May 2020 | US |