1. Field of the Invention
The present invention relates to a computing system and, more particularly, to a computing system that uses computing processors residing in data storage devices to process data in a highly parallel fashion.
2. Description of the Related Art
A computing system generally includes a Central Processing Unit (CPU), a cache, a main memory, a chip set, and a peripheral. The computing system normally receives data input from the peripheral and supplies the data to the CPU where the data is to be processed. The processed data can then be stored back to the peripheral. The CPU can, for example, be an Arithmetic Logic Unit (ALU), a floating-point processor, a Single-Instruction-Multiple-Data execution (SIMD) unit, or a special functional unit. The peripheral can be a memory peripheral, such as a hard disk drive or any nonvolatile massive data storage device to provide mass data storage, or an I/O peripheral device, such as a printer or graphics sub-system, to provide I/O capabilities. The main memory provides less data storage than the hard drive peripheral but at a faster access time. The cache provides even lesser data storage capability than the main memory, but at a much faster access time. The chip set contains supporting chips for said computing system and, in effect, expands the small number of I/O pins with which the CPU can communicate with many peripherals.
An execution model indicates how a computing system works.
A vector computer is able to exploit data parallelism to speed up those special applications that can be vectorized. However, vector computers replicate many expensive hardware components such as vector CPUs and vector register files to achieve high performance. Moreover, vector computers require very high data bandwidth in order to support the vector CPUs. The end result is a very expensive, bulky and power hungry computing system.
In recent years, logic has been embedded into memories to provide a special purpose computing system to perform specific processing. Memories that include processing capabilities are sometimes referred to as “smart memory” or intelligent RAM. Research on embedding logic into memories has led to some technical publications, namely: (1) Duncan G, Elliott, “Computational RAM: A Memory-SIMD Hybrid and its Application to DSP,” Custom Integrated Circuit Conference, Session 30.6, 1992, which describes simply a memory chip integrating bit-serial processors without any system architecture considerations; (2) Andreas Schilling et al., “Texram: A Smart Memory for Texturing,” Proceedings of the Sixth International Symposium on High Performance Computer Architecture, IEEE, 1996, which describes a special purpose smart memory for texture mapping used in a graphics subsystem; (3) Stylianos Perissakis et al., “Scalable Processors to 1 Billion Transistors and Beyond: IRAM,” IEEE Computer, September 1997, pp. 75-78, which is simply a highly integrated version of a vector computer without any enhancement in architecture level; (4) Mark Horowitz et al., “Smart Memories: A Modular Configurable Architecture,” International Symposium of Computer Architecture, June 2000, which describes a project to try to integrate general purpose multi-processors and multi-threads on the same integrated circuit chip; and (5) Lewis Tucker, “Architecture and Applications of the Connection Machines,” IEEE Computer, 1988, pp. 26-28, which used massively distributed array processors connected by many processors, memories, and routers among them. The granularity of the memory size, the bit-serial processors, and the I/O capability is so fine that these processors end up spending more time to communicate than to process data.
Accordingly, there is a need for computing systems with improved efficiency and reduced costs as compared to conventional vector computers.
The invention pertains to a smart memory computing system that uses smart memory for massive data storage as well as for massive parallel execution. The data stored in the smart memory can be accessed just like the conventional main memory, but the smart memory also has many execution units to process data in situ. The smart memory computing system offers improved performance and reduced costs for those programs having massive data-level parallelism. This invention is able to take advantage of data-level parallelism to improve execution speed by, for example, use of inventive aspects such as algorithm mapping, compiler techniques, architecture features, and specialized instruction sets.
The invention can be implemented in numerous ways including, a method, system, device, data structure, and computer readable medium. Several embodiments of the invention are discussed below.
As a smart memory computing system, one embodiment of the invention includes at least: a user space wherein data within has data-level parallelism; a smart memory space wherein data within can be processed in parallel and in situ; a graphical user representation describing data in said user space and interactions therewith; and a compiler mapping data from said user space to said smart memory space and generating executable codes in accordance with the graphical user representation.
As a data structure for storing data, variables and attributes for use in a smart memory computing system, one embodiment of the invention includes at least: a previous data field that stores a prior data; a current data field that stores a current data; a variable field that stores at least one fixed variable; and an attributes field. The attributes field stores a plurality of attributes, such as a filler field, a pass field, and a coefficient field. Other aspects and advantages of the invention will become apparent from the following detailed description taken in conjunction with the accompanying drawings which illustrate, by way of example, the principles of the invention.
The present invention will be readily understood by the following detailed description in conjunction with the accompanying drawings, wherein like reference numerals designate like structural elements, and in which:
a) shows the detailed diagram of the EX, GER, and LER.
a) shows the detailed diagram of the EX, GER, and LER with tag bits in the pipeline registers.
b): shows a portion of parallel code to illustrate how the instructions work with architecture features.
a) shows a graphical user representation for Smart Memory Computing.
b) shows a block diagram of how a smart memory compiler works, according to one embodiment.
a),(b) shows a parallel ADD instruction without any qualifier.
c) shows a parallel ADD instruction with qualifier “W” for register file access in wrap-around fashion.
d) shows a parallel ADD instruction with qualifier “L” using a source operand from the leftmost register file for all execution units.
e) shows a parallel ADD instruction with qualifier “R” using a source operand from the rightmost register file for all execution units.
f) shows a parallel AND instruction with qualifier “C” to concatenate the source operands from all register files into one.
a) shows a LOAD/STORE instruction using an index register as memory address offset.
b) shows a LOAD/STORE instruction using an immediate field as memory address offset.
a) shows pre-determined timing delay in data movement when executing a MOV instruction.
b) shows an event trigger in data movement when executing a MOV instruction.
The invention pertains to a smart memory computing system that uses smart memory for massive data storage as well as for massive parallel execution. The data stored in the smart memory can be accessed just like the conventional main memory, but the smart memory also has many execution units to process data in situ. The smart memory computing system offers improved performance and reduced costs for those programs having massive data-level parallelism. This invention is able to take advantage of data-level parallelism to improve execution speed by, for example, use of inventive aspects such as algorithm mapping, compiler techniques, architecture features, and specialized instruction sets.
I. Algorithm Mapping
Smart memory computing is most suitable to solve problems using massive data-level parallelism in fine grain. The 2D or 3D computational fluid dynamics, electromagnetic field, encryption/decryption, image processing, and sparse matrices are some of the examples.
((Φi+1,j+Φi−1,j+Φi,j+1+Φi,j−1)−4Φi,j)/(h2)=ρi,j/∈ i=1, . . . , N−1, j=1, . . . , M−1
or Φi,j=(Φi+1,j+Φi−1,j+Φi,j+1+Φi,j−1)/4−(h2/4)ρi,j/∈ i=1, . . . , N−1, j=1, . . . , M−1 (Eq. 1.1),
where h=W/N=L/M.
The simultaneous equations Eq. 1.1 can be readily solved by iterations. After applying an initial value for each data point, the new Φi,j can be readily calculated based on the values of its four neighbors, the charge density ρi,j, and the permittivity of a dielectric ∈. This procedure can be iterated several times until the maximum difference between two successive iterations are less than the allowable tolerance. Hence, such smart memory computing can process multiple calculations simultaneously to yield high speed operation. In contrast, a conventional scalar computer only processes one calculation at a time and, therefore, is considerably slower.
Due to advances in the integrated circuit technology, semiconductor devices can be fabricated very inexpensively and abundantly in a monolithic silicon. Conceptually, the invention can designate one processor for each data point in a mesh so that executions can be carried out in parallel. In one implementation, each data point, stored in the Smart Memory Integrated Circuits (SMICs), is cycled through an execution unit and stored back into Data Memory Blocks (DMBs) after being processed. The data points in the mesh are all processed in parallel and are only limited by the available execution units. In this way, the number of the execution units in hardware can be reduced and yet the benefit of massive parallelism can be achieved.
After the initial values are applied to all the data points, the two sections of data 19-0 and 19-1 are mapped into SMICs 18-0 and 18-1 for processing. Then, the next two sections of data 19-2 and 19-3 are brought into the SMICs 18-0 and 18-1 to process. This procedure can repeat until all sections in 19 are exhausted. The next iteration then begins until the maximum differences between two successive iterations are within a tolerance.
If the data transfer between SMICs and the user space is very slow, another algorithm is doing more iterations until the tolerance is met before bringing in the next two sections of data from the user space. In one embodiment, there are data updates between SMICs in each iteration. For example, the updated data in 29-5 would be copied to Data Memory Block 39-1 before the next iteration can begin.
Smart memory computing according to the invention can be further illustrated by a practical example.
A solution to the example can be determined in multiple passes. The first pass is to solve the Poisson's equation for the dielectric in the regions and the second pass is to calculate the continuity equation on the boundary 219-2. The area outside of the boundary 219-1 can be filled with fillers so that the parallel execution can be carried on in one pass. The filler can be, for example, a NaN, Not a Number, as in the IEEE 754 standard. Any operations involving NaNs still produce NaNs. Alternatively, an attribute bit can be set to indicate that this data point is a filler. In the convergence check when comparing two NaNs, the result can be set to zero by use of a SUBL instruction, SUBtract Literally, so that the magnitude comparison can be carried out in parallel even for fillers. The fixed-value boundary condition can be treated the same way by use of an attribute bit for “fixed value” before any smart memory execution begins. Any operation to generate a new result for those data points marked “fixed value” will be nullified. To consider two permittivities in the same parallel instruction, each execution unit has coefficient bits to select global coefficient registers (COEFREG) as the source operands. For example, the COEFFREG[1]=1 and COEFFREG[2]=3.9 are preloaded with two permittivities. The coefficient attributes for data points in the execution units 3 and 4 are set to 1 and 2, respectively. When one source operand of a parallel instruction is the coefficient register, the global coefficient registers 1 and 2 will be selected for execution units 3 and 4, respectively. By using the coefficient bits in the instruction field associated with each execution unit, the coefficients can be customized for each data point such that two passes can be merged into one.
Attribute bits can be included in a data structure for data to indicate the properties of each data point.
a) shows more detailed block diagram further to illustrate the execution units and the global and local execution registers GER and LER. The Global Execution Register (GER) 50 consists of global control register such as Mask Register 52, Pass Register 53, and global data registers such as Coefficient Register (COEFREG) 51. Those registers are applied to all execution units. On the contrary, the Local Execution Registers (LER) 45 are local data and control registers applied to each execution unit only. The Data Point Register (DPR) 46 can be loaded with the old data. The Data Attribute Register 47 can be used to store the attribute bits as shown in
a) show another embodiment of the datapath. The Data Point Register is not implemented in this embodiment. Instead, the register files 29-1, 29-3, 29-4, and 29-5 in
b) shows a representative example of the pseudo-code of instruction executions in smart memory to calculate each new data according to (Eq. 1.1). For example, M[A0,1] specifies the content of the memory address for data point (0,1) in the mesh. Similarly M[ρ1,3] specifies the content of the memory address for the charge density ρ at data point (1,3). In the actual code, only the opcode and the operands of the leftmost execution unit (EX1) are shown. The memory addresses for the other DMBs for parallel execution units can be generated automatically by using a stride register as an offset. For example, if each data point is 4-byte wide, A0,1=A0,0+4 if the data is stored as a row major configuration.
The first instruction “load M[A1, 1a],DAR.1” loads the attribute fields of data points (1,1), (1,3), (1,4), and (1,5) into each DAR, where A1,1a is the address of the attribute field for data point (1.1), to set the coefficient and the pass bits accordingly. Instruction 2 loads the old data into each Data Point Register (DPR) of the four execution units respectively. Instruction 3 sets the global pass register PASSREG to zero (0) to start processing data point (1,4) while the other three data points are kept the same. The Von Neumann boundary condition for point (1,4) is processed in instructions 3 through 19 involving indirect memory addressing to fetch the vector perpendicular to the boundary.
Instructions 20 through 32 process data points (1,1) and (1,3) by setting PASSREG=1 in instruction 20. The first four load instructions 21-24 load the four neighbors into register files, and followed by three add and a divide instruction to take the average of the four neighbors. Instructions 29 through 30 calculate ρ/∈, the charge density divided by permittivity. Instruction 30 specifies the coefficient registers being used as the source operands in a division. Since data points (1,1) and (1,3) have different permittivities of 1 and 3.9, respectively, these two values are selected by coeff=1 and 2, COEFFREG[1]=1 and COEFFREG[2]=3.9, according to the coefficient field in the DARs. Instruction 31 calculates the final values of each data point. The qualifier “f” indicates the resultant operand is the final result and will be stored back into the DMB through register files. Therefore, the data updated into R1.8, R3.8, R4.8, or R5.8 should depend on the pass bits and the fixed-value bit of each data point. In this example, the registers R4.8 and R5.8 will be unchanged, but the registers R1.8, and R3.8 will be updated into the data memory M[A1,1] and M[A1,3], respectively. Instruction 33 sets pass=2 into PASSREG to process point (1,5) only in the subsequent instructions.
Another embodiment is to use hardware to process the fixed-value and Von Newman boundary conditions rather than using a data structure.
The instruction 22-0 to find the normal vectors on boundary surfaces of memory addresses and put the results into registers have three bit fields, instruction opcode 22-1, memory address 22-2, and pointer register 22-3. The opcode field 22-1 specifies a special “LOAD” instruction opcode to load the normal vector on a boundary. The memory address field 22-2 specifies the memory address of coordinates of Neumann's boundary. And the pointer register field 22-3 specifies the register index of where the resultant vectors reside after parallel search conducted on the CAM 50. The CAM 50 stores the memory addresses to be matched and a plurality of tag bits such as Valid bit (V) 50-4, Von Neumann bit 50-2 (VN), Old/New bit (O/N) 50-3, and Fixed-value bit (F) 50-1. The Valid bit 50-4 specifies if that particular entry contains valid data or not. The Von Neumann bit 50-2 specified if that particular entry falls on Neumann's boundary condition. The Old/New bit (O/N) 50-3 toggles to mark as “Old” once that particular entry has been matched. The Fixed-value bit 50-1 specifies if that entry is fixed-value type of boundary condition. With the F and NV bit, the addresses of Fixed-value and Neumann's boundary conditions can be put into the same CAM 50. The CAM Mask Register 50-5 has the same bit fields as one entry of CAM 50 to enable the corresponding bit field in the parallel search when set.
The Old/New bit is designated for resolving multiple matches. In such a case, the O/N bit of the matched entity will be marked as “Old” so that this entry will be skipped in other matchings later. This process can continue until all entries in the CAM 50 are sought out for matches. Then an Old/New bit in a global register will be toggled to change the meaning of “Old” into “New”. The connotation of “Old/New” in each entry of CAM 50 is relative to the Old/New bit in the global register.
When all VN continuity equations are processed in a pass, the Old/New bit in a global control register will be toggled so that all VN entries will become new in the next round of processing VN boundary condition. Either the fixed-value or Von Neumann's boundary conditions in
2. Graphical User Representation and Compiler Techniques
A compiler for smart memory computing is more than just generating assembly codes from high level languages line by line and page by page as in conventional computing. The compiler is also responsible for mapping the datapoints in the user space for the problems to be solved to the smart memory space by preserving meaningful physical locations so that parallel executions can be carried on with ease. This mapping involves how to describe geometry objects to the smart memory computing. Conventional computing does not require the physical locations of the datapoints for processing. However, the physical locations of the datapoints are vital for algorithm mapping in the smart memory computing. One embodiment of an interface between user space and the smart memory space is through a graphical user representation, e.g., provided by a graphical user interface so that datapoints in the user space can be mapped directly to the smart memory space
a) shows graphical user representation for Smart Memory Computing. The example represented in
Boundaries of any geometry objects can be specified similarly as any 3D objects described above by labeling surfaces with different colors to associate with values or boundary types. The 3D objects and boundaries can be output as text files for smart memory compiler to read.
Equations to govern how datapoints interact, after boundary conditions are applied, in a smart memory computing can be specified by mathematical formula such as mathematical symbols, alphabet, Greek letters, or any combinations of the above. For examples, the Poisson's equation for the elliptical waveguide shown in
b) shows a block diagram of describes how a smart memory compiler 615 works, according to one embodiment. Smart memory compiler takes the geometry objects 610, boundary conditions 611, and governing equations 612 to do algorithm mapping 620 and generate parallel execution codes 630. If an algorithm such as finite difference is used, the algorithm mapping simply places datapoints in the user space into smart memory space by fitting one small chunk of datapoints into smart memory integrated circuits one by one. However, if an algorithm such as finite element is used, the compiler is also responsible for preprocessing the input files by dividing objects into small pieces, assigning variables, and generating relationship among those variables. In general, preprocessing may be needed for a smart memory compiler before algorithm mapping and parallel execution code generation based on the parallel instruction set the smart memory computing provides.
3. New Instructions
The abundant tightly-coupled processors in a smart memory computing system allows many specialized instructions to make parallel processing more efficient. The Simple Instruction Multiple Data (SIMD) architecture can be a single instruction applied to many data that behave exactly the same way. However, the smart memory computing is more than the SIMD architecture and allows the register files being indexed, the memory being addressed with offset among the different functional units and allows the parallel instructions with different qualifiers.
For example, the LOAD instruction 21 in
Load M[A0,1],R1.1 Load M[A0,3],R3.1 Load M[A0,4],R4.1 Load M[A0,5],R5.1, this parallel instruction specifies the memory location at A0,1 being loaded into the entry 1 of RF1, the memory location at A0,3 being loaded into entry 1 of RF3, and so on. If each data structure takes 4 bytes, the memory addresses A0,1, (A0,1)+8, (A0,1)+12, and (A0,1)+16 are loaded into entry 1 of RF1, RF3, RF4, and RF5, respectively. Similarly the ALU instructions can take the operands from other register files for executions. For example,
a),(b),(c),(d),(e) show different types of parallel instructions ADD; ADD,W; ADD,R; ADD,L; and ADD,C. In
a) shows an “ADD” instruction without any qualifiers. “ADD R0.3, R1.5, R1.6” adds entry 3 of the immediate left RF with entry 5 of the each RF and stores the result back into the entry 6 of the each RF. The rightmost register file RFR is not used.
d) and (e) shows two ADD instructions with qualifier “L” and “R” such that the RF0 and RFR are used for all execution units, respectively.
a) and (b) show the instruction format for LOAD/STORE instructions. These instructions load the data from DMBs into their respective RFs or store the data from RFs into their respective DMBs. In
The inter-SMIC data transfers can be chained together so that one transfer finishes can fire the other one.
b) shows another block transfer triggering mechanism based on event, rather than on a pre-determined timing. The blocks 630, 631, and 632 are signals generated after the MOV instruction is executed. The first DMA 600 is triggered immediately. But the second DMA 601 waits until the DMA 600 finishes its job. Similarly, the third DMA 602 waits until the DMA 602 finishes its job. The third DMA generates a signal to the first DMA indicating the end of the inter-SMIC transfer in a round robin fashion. The DMA 602 does not enable any further data transfer but just sends an ending signal to the first SMIC.
The advantage of DMA-like of operation for inter-SMIC data movement is (1) a single instruction can result in a block of data transfer; and (2) the data movement can overlap with the parallel executions. The target of the transfer is usually set to either DMB0 or DMBR as a buffer so that the normal parallel execution will not be interrupted. Once the inter-SMIC data transfer is completed, a bit “L/R” in a global control register can be set so that the role of DMB0 and DMBR; RF0 and RFR, can be reversed to continue another round of parallel executions without any code changes.
The invention of the smart memory computing system utilized a memory system having processing capabilities in additional to data storage. Therefore, the SMICs behave like a multiple-functional unit CPU with integrated memory. Moreover, the smart memory sub-system also has the bus master capabilities to interrupt CPU and to request bus ownership. The general computing concepts such as type and number of execution units, instruction decoder, register files, scratch pad RAM, data memories, pipeline, status and control registers, exception and interrupt can be applied to the smart memory computing system without loss the scope of this invention.
The invention is preferably implemented electronic circuitry, but can be implemented by electronic circuitry in combination with software. Such software can be embodied as computer readable code on a computer readable medium. The computer readable medium is any data storage device that can store data which can be thereafter be read by a computer system. Examples of the computer readable medium include read-only memory, random-access memory, CD-ROMs, magnetic tape, optical data storage devices, and carrier waves. The computer readable medium can also be distributed over network-coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.
The many features and advantages of the present invention are apparent from the written description and, thus, it is intended by the appended claims to cover all such features and advantages of the invention. Further, since numerous modifications and changes will readily occur to those skilled in the art, it is not desired to limit the invention to the exact construction and operation as illustrated and described. Hence, all suitable modifications and equivalents may be resorted to as falling within the scope of the invention.
This application is a continuation-in-part application of U.S. patent application Ser. No. 10/199,745, filed Jul. 19, 2002, now U.S. Pat. No. 6,970,988 and entitled “ALGORITHM MAPPING, SPECIALIZED INSTRUCTIONS AND ARCHITECTURE FEATURES FOR SMART MEMORY COMPUTING,” which is hereby incorporated herein by reference, which is a continuation-in-part application of U.S. patent application Ser. No. 10/099,440, filed Mar. 14, 2002, and entitled “Method and Apparatus of Using Smart Memories in Computing System,” now U.S. Pat. No. 6,807,614, which is hereby incorporated herein by reference, and which claimed priority benefit of U.S. Provisional Patent Application No. 60/306,636 and U.S. Provisional Patent Application No. 60/341,411, filed Dec. 17, 2001, which are both also hereby incorporated herein by reference. U.S. patent application Ser. No. 10/199,745 also claims the benefit of U.S. Provisional Patent Application No. 60/306,636, filed Jul. 19, 2001 and entitled “Method and Apparatus of Using Smart Memories in Computing System,” which is hereby incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
4823257 | Tonomura | Apr 1989 | A |
4873630 | Rusterholz et al. | Oct 1989 | A |
5226171 | Hall et al. | Jul 1993 | A |
5301340 | Cook | Apr 1994 | A |
5396641 | Lobst et al. | Mar 1995 | A |
5678021 | Pawate et al. | Oct 1997 | A |
5983004 | Shaw et al. | Nov 1999 | A |
6292903 | Coteus et al. | Sep 2001 | B1 |
6741616 | Sutherland et al. | May 2004 | B1 |
6807614 | Chung | Oct 2004 | B2 |
7159185 | Vedula et al. | Jan 2007 | B1 |
Number | Date | Country | |
---|---|---|---|
20050246698 A1 | Nov 2005 | US |
Number | Date | Country | |
---|---|---|---|
60306636 | Jul 2001 | US | |
60341411 | Dec 2001 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 10199745 | Jul 2002 | US |
Child | 11175559 | US | |
Parent | 10099440 | Mar 2002 | US |
Child | 10199745 | US |