The present invention relates to memory devices generally and to incorporation of data processing functions in memory devices in particular.
Memory arrays, which store large amounts of data, are known in the art. Over the years, manufacturers and designers have worked to make the arrays physically smaller while increasing the amount of data stored therein.
Computing devices typically have one or more memory arrays to store data and a central processing unit (CPU) and other hardware to process the data. The CPU is typically connected to the memory array via a bus. Unfortunately, while CPU speeds have increased tremendously in recent years, the bus speeds have not increased at an equal pace. Accordingly, the bus connection acts as a bottleneck to increased speed of operation.
U.S. patent application Ser. No. 12/119,197, whose disclosure is incorporated herein by reference and which is owned by the common assignees of the present application, describes a memory device which comprises RAM along with one or more special sections containing associative memory cells. These memory cells may be used to perform parallel computations at high speed. Integrating these associative sections or any other computing ability into the memory device minimizes the resources needed to transfer data into and out of the computation sections, and thus enables the device to perform logical and arithmetic operations on large vectors of bits far faster than is possible in conventional processor architectures.
The associative cells are functionally and structurally similar to CAM cells, in that comparators are built into each associative memory section so as to enable multiple multi-bit data words in the section to be compared simultaneously to a multi-bit comparand. These comparisons are used in the associative memory section as the basis for performing bit-wise operations on the data words.
As explained in the thesis by Akerib, entitled “Associative Real-Time Vision Machine” (Department of Applied Mathematics and Computer Science, Weizmann Institute of Science, Rehovot, Israel, March, 1992), these bit-wise operations serve as the building blocks for a wide range of arithmetic and logical operations, which can thus be performed in parallel over multiple words in the associative memory section.
Reference is now briefly made to
Element 50, however, includes at least one computation region 58, comprising a central slice 60 in which a computation section 64 is sandwiched between the rows of sense amplifiers 62 of the top and bottom arrays. Computation section 64 comprises CAM-like associative cells and tag logic, as explained in U.S. Ser. No. 12/119,197. Data bits stored in the cells of arrays 54 and 56 in region 58 are transferred to computation section 64 via sense amplifiers 62. Computation section 64 then performs any selected parallel processing on the data of the copied row, after which the results are written back into either top array 54 or bottom array 56. This arrangement permits rapid data transfer between the storage and computation sections of region 58 in the memory device. Although
There is provided, in accordance with a preferred embodiment of the present invention, a memory device including an external device, an internal processing element and multiple banks of storage. The external device interface is connectable to an external device communicating with the memory device and the internal processing element processes data stored on the device. Each bank includes a plurality of storage units and each storage unit has two ports, an external port connectable to the external device interface and an internal port connected to the internal processing element.
Moreover, in accordance with a preferred embodiment of the present invention, the plurality of storage units are formed into an upper row of units and a lower row of units and also include a computation belt between the upper and lower rows, wherein the internal port and the processing element are located within the computation belt.
Additionally, in accordance with a preferred embodiment of the present invention, the computation belt includes an internal bus to transfer the data from the internal port to the processing element.
Further, in accordance with a preferred embodiment of the present invention, the internal bus is a reordering bus to reorder the output of the internal port to match a pre-storage logical order of the data.
Still further, in accordance with a preferred embodiment of the present invention, the reordering bus includes four lines each to provide bytes from one of the internal ports to every fourth byte storage unit of the processing element.
Additionally, in accordance with a preferred embodiment of the present invention, each line connects between one internal port and the processing element.
Further, in accordance with a preferred embodiment of the present invention, two of the lines connect between one internal port and the processing element.
Moreover, in accordance with a preferred embodiment of the present invention, the internal port includes a plurality of sense amplifiers and a buffer to store the output of the sense amplifiers.
Further, in accordance with a preferred embodiment of the present invention, the banks of storage include one of the following types of memory: DRAM memory, 3T DRAM, SRAM memory, ZRAM memory and Flash memory.
Additionally, in accordance with a preferred embodiment of the present invention, the processing element includes 3T DRAM elements.
Moreover, in accordance with a preferred embodiment of the present invention, the processing element also includes sensing circuitry to sense a Boolean function of at least two activated rows of the 3T DRAM elements.
Further, in accordance with a preferred embodiment of the present invention, the processing element includes a shift operator.
There is also provided, in accordance with a preferred embodiment of the present invention, a memory device including a plurality of storage banks and a computation belt. The plurality of storage banks store data and are formed into an upper row of units and a lower row of units. The computation belt is located between the upper and lower rows and performs on-chip processing of data from the storage units.
Moreover, in accordance with a preferred embodiment of the present invention, each bank includes a plurality of storage units and each storage unit has an internal port forming part of the computation belt.
Additionally, in accordance with a preferred embodiment of the present invention, the computation belt includes a processing element.
Further, in accordance with a preferred embodiment of the present invention, the computation belt includes an internal bus to transfer the data from the internal ports to the processing element.
There is also provided, in accordance with a preferred embodiment of the present invention, a memory device including a plurality of storage units and a within-device reordering unit. The plurality of storage units store data of a bank, wherein the data has a logical order prior to storage and a physical order different than the logical order within the plurality of storage units. The within-device reordering unit reorders the data of a bank into the logical order prior to performing on-chip processing.
Moreover, in accordance with a preferred embodiment of the present invention, the storage units are formed of DRAM memory units.
Further, in accordance with a preferred embodiment of the present invention, the reordering unit includes a plurality of sense amplifiers, each to read data of its associated storage unit and a data transfer unit to reorder the output of the sense amplifiers to match the logical order of the data.
Still further, in accordance with a preferred embodiment of the present invention, N storage units spread across the memory device form a bank to which an external device writes data and the data transfer unit operates to provide data of one bank to an on-chip processing element.
Additionally, in accordance with a preferred embodiment of the present invention, the data transfer unit includes an internal bus and at least one compute engine controller at least to indicate to the internal bus how to place data from each of the plurality of the sense amplifiers associated with storage units of one of the banks into the processing element.
Moreover, in accordance with a preferred embodiment of the present invention, the internal bus includes N lines each to transfer a unit of data between the sense amplifiers of one storage unit and every Nth data location of the processing element, wherein the lines together connect to all data locations of the processing element.
Alternatively, in accordance with a preferred embodiment of the present invention, the internal bus includes N lines each to transfer a unit of data between the sense amplifiers and every Nth data location of the processing element, wherein two of the lines transfer from one storage unit and two of the lines transfer from a second storage unit.
Moreover, in accordance with a preferred embodiment of the present invention, the at least one compute engine controller indicates to the internal bus where to begin placement or removal of the data.
Further, in accordance with a preferred embodiment of the present invention, the processing element includes a 3T DRAM array, sensing circuitry for sensing the output when multiple rows of the 3T DRAM array are generally simultaneously activated and a write unit to write the output back to the 3T DRAM array.
Still further, in accordance with a preferred embodiment of the present invention, the memory device includes a 3T DRAM array and the reordering unit writes back to the 3T DRAM array for processing.
There is still further provided, in accordance with a preferred embodiment of the present invention, a method of performing parallel processing on a memory device. The method includes, on the device, performing neighborhood operations on data stored in a plurality of storage units of a bank, even though the data has a logical order prior to storage and a physical order different than the logical order within the plurality of storage units.
Moreover, in accordance with a preferred embodiment of the present invention, the performing includes accessing data from the plurality of storage units, reordering the data into its logical order and performing neighborhood operations on the reordered data.
Finally, in accordance with a preferred embodiment of the present invention, the neighborhood operations form part of image processing operations.
The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:
It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present invention.
Many memory units, such as DRAMs and others, are not committed to maintaining the original, “logical” order of the data (i.e. the order by which the data is provided to the memory unit). Instead, many memory units change the logical order to a “physical” order when storing it among the multiple storage elements of the memory unit, at least in part for efficiency. The memory units reorder the data upon reading it out.
Reference is now made to
As illustrated in
As shown in
Memory array 102 is shown divided into four regions 110, where each region 110 may be divided into multiple quads 112.
Running along the horizontal middle of memory array 102 is a horizontal belt 114 and running along the vertical middle of memory array 102 is a spine 116. Belt 114 and spine 116 may be used to run power and control lines to the various elements of memory array 102.
When data is to be read from a specific row in a specific section 120, address decoder 104 (
Reference is now made to
memory device 202 Like memory array 102 of
Mirror main sense amplifiers 220 may be located on the side of each quad 112 close to CE belt 214, connected to the same main bus (MDQ) 126 as main sense amplifiers 106. In effect and as shown in
Mirror main sense amplifiers 220 may operate in the same way as main sense amplifiers 106. However, mirror main sense amplifiers 220 may connect their quads 112 to the internal processing elements of processing belt 204 via internal bus 225 while main sense amplifiers 106 may connect their quads to external processing elements, such as external device 10 (
Mirror main sense amplifiers 220 may be controlled by similar but parallel logic to that which controls main sense amplifiers 106. They may work in lock-step with main sense amplifiers 106, such that data may be copied to both main sense amplifiers 106 and mirror main sense amplifiers 220 at similar times, or they may work independently.
There may be the same number of mirror main sense amplifiers 220 per quad as main sense amplifiers 106 or a simple multiple of the number of main sense amplifiers 106. Thus, if there are 32 main sense amplifiers 106 per quad 112, there may be 32, 64 or 128 mirror main sense amplifiers 220 per quad 112.
Unlike main sense amplifiers 106, which may all be connected to an output bus (not shown), each set of mirror main sense amplifiers 220 per quad 112 may be connected to an associated buffer 221, which may hold the data until processing element 224 may require it. Thus, mirror main sense amplifiers 220 may enable accessing all quads in all banks, in parallel, if desired. Such is not possible with main sense amplifiers 106 which all provide their output directly to the same output bus and, accordingly, it is not possible for them to work at the same time. Moreover, buffers 221 may enable memory device 202 to have a similar timing to that of a memory array in a standard DRAM.
Mirror main sense amplifiers 220 may be connected to processing element 224 via internal bus 225, which may be a standard bus or an internal bus, as described in more detail hereinbelow. Internal bus 225 may be M bits wide, where M may be a function of the number of mirror main sense amplifiers 220 per quad 112. For example, M may be 64 or 128.
Processing element 224 may be any suitable processing or comparison element. For example, processing element 224 may be a massively parallel processing element, such as Processing elementary of the processing elements described in US patent publications 2009/0254694, 2009/0254697 and in U.S. patent application Ser. Nos. 12/503,916 and 12/464,937, all owned by the common assignee of the present invention and all incorporated herein by reference.
Processing element 224 may be formed of CAM cells or of 3T DRAM cells or any other suitable type of cell. They may perform a calculation or a Boolean operation. The latter is described in U.S. Ser. No. 12/503,916, filed Jul. 16, 2009, owned by the common assignee of the present invention and incorporated herein by reference, and requires relatively few rows in processing element 224. This is discussed hereinbelow with respect to
Processing element 224 may be controlled by compute engine controllers (CEC) 226 which may, in turn, be controlled by microcontroller 228. If microcontroller 228 runs at a lower frequency than the frequency of processing element 224, multiple compute engine controllers 226 may be required.
It may be appreciated that, by placing mirror main sense amplifiers 220 close to processing element 224, there may be a minimum of additional wiring to bring data to processing element 224. Furthermore, by placing all of the internal processing elements (i.e. mirror main sense amplifiers 220, buffers 221, processing element 224, internal bus 225, compute engine controllers 226 and microcontroller 228) within CE belt 214 (rather than in separate computation sections 64 as previously discussed), the present invention may incur a relatively small increase to the real estate of a standard DRAM, while providing a significant increase in its functioning.
Applicants have realized that the physical disordering of the data from its original, logical form upon storing the data makes the massively parallel processing of computation section 64 (
DRAM 100 then stores the data in memory array 102. However, address decoder 104 and the other elements (not shown) involved in writing to memory array 102 allocate neighboring logical addresses such that neighboring logical addresses are not next to each other in the array. Two examples of this are shown in
Address decoder 104 may divide each 32 bit word into four, 8 bit bytes, labeled “a”, “b”, “c” and “d” and, in the example of
In an alternative example, shown in
Neither situation presents a problem for external access to the data, since external device 10 is not aware of how memory array 102 internally stores the data. Address decoder 104 is responsible for translating the address request of the external element to the actual storage location within memory 102 and the data which is read out is reordered before it arrives back at external device 10.
Address decoder 104 is responsible for another address request translation also illustrated in
In U.S. patent application Ser. No. 12/119,197, the data is sequential and is copied from one row of memory into a row in computation section 64 (
For example, image processing processes images by performing neighborhood operations on pixels in the neighborhood around a central pixel. A typical operation of this sort may be a blurring operation or the finding of an edge of an object in the image. These operations typically utilize direct cosine translations (DCTs), convolutions, etc. In DRAM 100, neighboring pixels may be far away from each other (for example, in
Similarly, many parallel processing paradigms, whether of U.S. Ser. No. 12/119,197 or some other paradigm, cannot rely on copying the data out of memory array 102 one row at a time.
In accordance with a preferred embodiment of the present invention, by placing the internal processing elements in computation belt 214, rather than within each computation section 64 (which typically is located within section 120 of quad 112), the mapping operation of address decoder 104, which ensures that main sense amplifiers 106 receive the correct data, irrespective of any bad columns, may be utilized. Thus, mirror main sense amplifiers 220 may also receive the correct data.
Furthermore, in accordance with a preferred embodiment of the present invention, internal bus 225 may be a rearranging bus to compensate for the physical disordering across quads 112, by bringing data from all of the quads 112 to processing element 224. The particular structure of internal bus 225 may be a function of the kind of disordering performed by the DRAM, whether that of
It will be appreciated that internal bus 225 may reorder the data to bring it back to its original, logical, order, such that processing element 224 may perform parallel processing thereon, as described hereinbelow.
Reference is now made to
MCU 228 may instruct internal bus 225 to bring M bytes of a row from each quad 112 of one bank at each cycle. In the example of
In the example of
In this manner, internal bus 225 may bring the separated bytes next to each other in processing element 230. The number of bits read in a cycle may vary. For example, 128 bits may be read each cycle with each read coming entirely from one quad. Alternatively, 64 bits or 128 bits may read from 2 quads in one cycle. It will be understood that internal bus 225 may bring any desired amount of data during a cycle.
Internal bus 225 may bring the data of a single bank to processing element 224, thereby countering the disorder of a single bank. However, this may be insufficient, particularly for performing neighborhood operations on the data at one end or other of a bank (in the example of
It will be appreciated that processing element 224 may have multiple rows therein and that MCU 228 may indicate to internal bus 225 to place the data in any appropriate row of processing element 224. This may be particularly useful for neighborhood operations and/or for operations performed on multiple rows of data.
In another embodiment, MCU 228 may instruct internal bus 225 to place the data of subsequent cycles to a row of processing element 224 directly below the data from a previous cycle.
It will be appreciated that the combination of internal bus 225 (a hardware element) and MCU 228 with compute engine controllers 226 (under software control) may enable any rearrangement of the data. Thus, if each bank of the DRAM is divided into N storage units (where, in the example shown hereinabove, there were 4 storage units called quads), MCU 228 may instruct internal bus 225 to drop the bytes of each storage unit at every Nth byte storage unit 230 of processing element 224 (for the embodiment of
In an alternative embodiment, internal bus 225 may bring the data directly to processing element 224, rather than dropping the data every Nth section.
In one embodiment, processing element 224 may comprise storage rows, storing the data as described hereinabove, and processing rows, in which the computations may occur. Any appropriate processing may occur. Processing element 224 may perform the same operation on each row or set of rows, thereby providing a massively parallel processing operation within memory device 202. In another embodiment, memory array 101 is not a DRAM but any other type of memory array, such as SRAM (Static RAM), Flash, ZRAM (Zero-Capacitor RAM), etc. It will be appreciated that the above discussion provided the data to processing element 224. Each of the elements may also operate in reverse. Thus, internal bus 225 may take the data of a row of processing element 224, for example, after processing of the row has finished, and may provide it to mirror main sense amplifiers 220, which, in turn, may write the bytes to the separate quads 112, according to the physical order.
In an alternative embodiment, CE belt 204 may not include mirror main sense amplifiers 220 and may, instead, utilize main sense amplifiers 126.
In a further embodiment, processing element 224 may comprise a shift operator 250, shown in
As shown in
Shift operator 250 may additionally comprise select lines for each set of transistors, a “shift_left” to control both sets 252-1 and 252-2, a “shift right” to control both sets 254-1 and 254-2, and a “shift—1” to control set 256.
To shift a row of data elements to the right, for example, to shift data elements from location A1 to location A2, location A2 to location A3, etc., CEC 226 may activate select lines shift right, to activate both sets of right direction gates 254-1 and 254-2, and shift—1 to shift the data by one data element. The exemplary path is marked in
If desired, shift operator 250 may also include other shift transistors between the sets of direction shifting gates 252 and 254, to shift the data more than one location to the right or to the left. These shift transistors may be selectable, such that shift operator 250 may be activated to shift by a different amount of data elements each time it is activated.
It will be appreciated that shift operator 250 also includes a direct path 258 from each element (e.g. A1) of row 224-1 to its corresponding element (e.g. A1) of row 224-2, for operations which do not require a shift.
It will be appreciated that shift operator 250 may provide a parallel shift operation, since it operates on an entire row at once.
Reference is now briefly made to
In the embodiment of
Memory device 202 may also be formed from a 3T DRAM memory array. In this embodiment, the memory array may have two sections, one storing the physically disordered data, and one for in-memory processing. Internal bus 225 may take the disordered data, reorder it and rewrite it back to the in-memory processing section.
In an alternative embodiment, the memory array may have only one section. The data may initially be written into it in a disordered way. Whenever a row or a section of data may be desired to be processed, the row may be read out, reordered by internal bus 225 and then written back, in order, into the row or section. Memory device 202 may then process the reordered data, in place, as discussed in U.S. Ser. No. 12/503,916.
It will be appreciated that the present invention may provide in-memory parallel processing for any memory array which may have a different physical order for storage than the original, logical order of the data. The present invention provides decoding, by reading with mirror main sense amplifiers 220, rearranging, via internal bus 225, and configuration, via compute engine controllers 226, to control where bus 225 places the data in processing element 224. This simple mechanism may restore any disordering of the data and thus, may enable parallel processing, particularly for performing neighborhood operations.
As discussed hereinabove, some of the neighborhood operations may include shift operations. Thus, memory device 202 may be able to perform a logical or mathematical computation on neighborhood data in its logical order after which the results may be shifted to the right or left and the shifted result returned for storage in its physical order.
While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes, and equivalents will now occur to those of ordinary skill in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.
This application claims priority benefit from U.S. Provisional Patent Application No. 61/253,563, filed Oct. 21, 2010, which is hereby incorporated in its entirety by reference.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB10/54526 | 10/6/2010 | WO | 00 | 6/19/2012 |
Number | Date | Country | |
---|---|---|---|
61253563 | Oct 2009 | US |