The following description relates generally to multi-processor systems, and more particularly to a system having two memory access paths: 1) a cache-access path in which block data is fetched from main memory for loading to a cache, and 2) a direct-access path in which individually-addressed data is fetched from main memory for directly loading data into processor registers and/or storing data.
The popularity of computing systems continues to grow and the demand for improved processing architectures thus likewise continues to grow. Ever-increasing desires for improved computing performance/efficiency has led to various improved processor architectures. For example, multi-core processors are becoming more prevalent in the computing industry and are being used in various computing devices, such as servers, personal computers (PCs), laptop computers, personal digital assistants (PDAs), wireless telephones, and so on.
In the past, processors such as CPUs (central processing units) featured a single execution unit to process instructions of a program. More recently, computer systems are being developed with multiple processors in an attempt to improve the computing performance of the system. In some instances, multiple independent processors may be implemented in a system. In other instances, a multi-core architecture may be employed, in which multiple processor cores are amassed on a single integrated silicon die. Each of the multiple processors (e.g., processor cores) can simultaneously execute program instructions. This parallel operation of the multiple processors can improve performance of a variety of applications.
A multi-core CPU combines two or more independent cores into a single package comprised of a single piece silicon integrated circuit (IC), called a die. In some instances, a multi-core CPU may comprise two or more dies packaged together. A dual-core device contains two independent microprocessors and a quad-core device contains four microprocessors. Cores in a multi-core device may share a single coherent cache at the highest on-device cache level (e.g., L2 for the Intel® Core 2) or may have separate caches (e.g. current AMD® dual-core processors). The processors also share the same interconnect to the rest of the system. Each “core” may independently implement optimizations such as superscalar execution, pipelining, and multithreading. A system with N cores is typically most effective when it is presented with N or more threads concurrently.
One processor architecture that has been developed utilizes multiple processors (e.g., multiple cores), which are homogeneous in that they are all implemented with the same fixed instruction sets (e.g., Intel's x86 instruction set, AMD's Opteron instruction set, etc.). Further, the homogeneous processors may employ a cache memory coherency protocol, as discussed further below.
In general, an instruction set refers to a list of all instructions, and all their variations, that a processor can execute. Such instructions may include, as examples, arithmetic instructions, such as ADD and SUBTRACT; logic instructions, such as AND, OR, and NOT; data instructions, such as MOVE, INPUT, OUTPUT, LOAD, and STORE; and control flow instructions, such as GOTO, if X then GOTO, CALL, and RETURN. Examples of well-known instruction sets include x86 (also known as IA-32), x86-64 (also known as AMD64 and Intel® 64), AMD's Opteron, VAX (Digital Equipment Corporation), IA-64 (Itanium), and PA-RISC (HP Precision Architecture).
Generally, the instruction set architecture is distinguished from the microarchitecture, which is the set of processor design techniques used to implement the instruction set. Computers with different microarchitectures can share a common instruction set. For example, the Intel® Pentium and the AMD® Athlon implement nearly identical versions of the x86 instruction set, but have radically different internal microarchitecture designs. In all these cases the instruction set (e.g., x86) is fixed by the manufacturer and directly hardware implemented, in a semiconductor technology, by the microarchitecture. Consequently, the instruction set is fixed for the lifetime of this implementation.
Cache memory coherency is an issue that affects the design of computer systems in which two or more processors share a common area of memory. In general, processors often perform work by reading data from persistent storage (e.g., disk) into memory, performing some operation on that data, and then storing the result back to persistent storage. In a uniprocessor system, there is only one processor doing all the work, and therefore only one processor that can read or write the data values. Moreover a simple uniprocessor can only perform one operation at a time, and thus when a value in storage is changed, all subsequent read operations will see the updated value. However, in multiprocessor systems (e.g., multi-core architectures) there are two or more processors working at the same time, and so the possibility that the processors will all attempt to process the same value at the same time arises. Provided none of the processors updates the value, then they can share it indefinitely; but as soon as one updates the value, the others will be working on an out-of-date copy of the data. Accordingly, in such multiprocessor systems a scheme is generally required to notify all processors of changes to shared values, and such a scheme that is employed is commonly referred to as a “cache coherence protocol.” Various well-known protocols have been developed for maintaining cache coherency in multiprocessor systems, such as the MESI protocol, MSI protocol, MOSI protocol, and the MOESI protocol, are examples. Accordingly, such cache coherency generally refers to the integrity of data stored in local caches of the multiple processors.
As shown further shown, a cache 103 is also implemented on die 102. Cores 104A and 104B are each communicatively coupled to cache 103. As is well known, a cache generally is memory for storing a collection of data duplicating original values stored elsewhere (e.g., to main memory 101) or computed earlier, where the original data is expensive to fetch (due to longer access time) or to compute, compared to the cost of reading the cache. In other words, a cache 103 generally provides a temporary storage area where frequently accessed data can be stored for rapid access. Once the data is stored in cache 103, future use can be made by accessing the cached copy rather than re-fetching the original data from main memory 101, so that the average access time is shorter. In many systems, cache access times are approximately 50 times faster than similar accesses to main memory 101. Cache 103, therefore, helps expedite data access that the micro-cores 104A and 104B would otherwise have to fetch from main memory 101.
In many system architectures, each core 104A and 104B will have its own cache also, commonly called the “L1” cache, and cache 103 is commonly referred to as the “L2” cache. Unless expressly stated herein, cache 103 generally refers to any level of cache that may be implemented, and thus may encompass L1, L2, etc. Accordingly, while shown for ease of illustration as a single block that is accessed by both of cores 104A and 104B, cache 103 may include L1 cache that is implemented for each core. Again, a cache coherency protocol may be employed to maintain the integrity of data stored in local caches of the multiple processor cores 104A/104B, as is well known.
In many architectures, virtual addresses are utilized. In general, a virtual address is an address identifying a virtual (non-physical) entity. As is well-known in the art, virtual addresses may be utilized for accessing memory. Virtual memory is a mechanism that permits data that is located on a persistent storage medium (e.g., disk) to be referenced as if the data was located in physical memory. Translation tables, maintained by the operating system, are used to determine the location of the reference data (e.g., disk or main memory). Program instructions being executed by a processor may refer to a virtual memory address, which is translated into a physical address. To minimize the performance penalty of address translation, most modern CPUs include an on-chip Memory Management Unit (MMU), and maintain a table of recently used virtual-to-physical translations, called a Translation Look-aside Buffer (TLB). Addresses with entries in the TLB require no additional memory references (and therefore time) to translate. However, the TLB can only maintain a fixed number of mappings between virtual and physical addresses; when the needed translation is not resident in the TLB, action will have to be taken to load it in.
As an example, suppose a program's instruction stream that is being executed by a processor, say processor core 104A of
Traditional implementations of cache 103 have proven to be extremely effective in many areas of computing because access patterns in many computer applications have locality of reference. There are several kinds of locality, including data that are accessed close together in time (temporal locality) and data that is located physically close to each other (spatial locality).
In operation, each of cores 104A and 104B reference main memory 101 by providing a physical memory address. The physical memory address (of data or “an operand” that is desired to be retrieved) is first inputted to cache 103. If addressed data is not encached (i.e., not present in cache 103), the same physical address is presented to main memory 101 to retrieve the desired data.
In contemporary architectures, a cache block is fetched from main memory 101 and loaded into cache 103. That is, rather than retrieving only the addressed data from main memory 101 for storage to cache 103, a larger block of data may be retrieved for storage to cache 103. A cache block typically comprises a fixed-size amount of data that is independent of the actual size of the requested data. For example, in most implementations a cache block comprises 64 bytes of data that is fetched from main memory 101 and loaded into cache 103 independent of the actual size of the operand referenced by the requesting micro-core 104A/104B. Furthermore, the physical address of the cache block referenced and loaded is a block address. This means that all the cache block data is in sequentially contiguous physical memory. Table 1 below shows an example of a cache block.
In the example of table 1, in response to a micro-core 104A/104B requesting Operand 0 via its corresponding physical address X,Y,Z (0), a 64-byte block of data may be fetched from main memory 101 and loaded into cache 103, wherein such block of data includes not only Operand 0 but also Operands 1-7. Thus, depending on the fixed size of the cache block employed on a given system, whenever a core 104A/104B references one operand (e.g., a simple load), the memory system will bring in 4 to 8 to 16 operands into cache 103.
There are both advantages and disadvantages of this traditional approach. One advantage is that if there is temporal (over time) and spatial (data locality) references to operands (e.g., operands 0-7 in the example of Table 1), then cache 103 reduces the memory access time. Typically, cache access times (and data bandwidth) are 50 times faster than similar access to main memory 101. For many applications, this is the memory access pattern.
However, if the memory access pattern of an application is not sequential and/or does not re-use data, inefficiencies arise which result in decreased performance. Consider the following FORTRAN loop that may be executed for a given application:
DO I=1, N, 4
A(i)=B(i)+C(i)
END DO
In this loop, every fourth element is used. If a cache block maintains 8 operands, then only 2 of the 8 operands are used. Thus, 6/8 of the data loaded into cache 103 and 6/8 of the memory bandwidth is “wasted” in this example.
In some architectures, special-purpose processors that are often referred to as “accelerators” are also implemented to perform certain types of operations. For example, a processor executing a program may offload certain types of operations to an accelerator that is configured to perform those types of operations efficiently. Such hardware acceleration employs hardware to perform some function faster than is possible in software running on the normal (general-purpose) CPU. Hardware accelerators are generally designed for computationally intensive software code. Depending upon granularity, hardware acceleration can vary from a small functional unit to a large functional block like motion estimation in MPEG2. Examples of such hardware acceleration include blitting acceleration functionality in graphics processing units (GPUs) and instructions for complex operations in CPUs. Such accelerator processors generally have a fixed instruction set that differs from the instruction set of the general-purpose processor, and the accelerator processor's local memory does not maintain cache coherency with the general-purpose processor.
A graphics processing unit (GPU) is a well-known example of an accelerator. A GPU is a dedicated graphics rendering device commonly implemented for a personal computer, workstation, or game console. Modern GPUs are very efficient at manipulating and displaying computer graphics, and their highly parallel structure makes them more effective than typical CPUs for a range of complex algorithms. A GPU implements a number of graphics primitive operations in a way that makes running them much faster than drawing directly to the screen with the host CPU. The most common operations for early two-dimensional (2D) computer graphics include the BitBLT operation (combines several bitmap patterns using a RasterOp), usually in special hardware called a “blitter”, and operations for drawing rectangles, triangles, circles, and arcs. Modern GPUs also have support for three-dimensional (3D) computer graphics, and typically include digital video-related functions.
Thus, for instance, graphics operations of a program being executed by host processors 104A and 104B may be passed to a GPU. While the homogeneous host processors 104A and 104B maintain cache coherency with each other, as discussed above with
Additionally, various devices are known that are reconfigurable. Examples of such reconfigurable devices include field-programmable gate arrays (FPGAs). A field-programmable gate array (FPGA) is a well-known type of semiconductor device containing programmable logic components called “logic blocks”, and programmable interconnects. Logic blocks can be programmed to perform the function of basic logic gates such as AND, and XOR, or more complex combinational functions such as decoders or simple mathematical functions. In most FPGAs, the logic blocks also include memory elements, which may be simple flip-flops or more complete blocks of memories. A hierarchy of programmable interconnects allows logic blocks to be interconnected as desired by a system designer. Logic blocks and interconnects can be programmed by the customer/designer, after the FPGA is manufactured, to implement any logical function, hence the name “field-programmable.”
The present invention is directed to a system and method which employ two memory access paths: 1) a cache-access path in which block data is fetched from main memory for loading to a cache, and 2) a direct-access path in which individually-addressed data is fetched from main memory for directly loading data into processor registers and/or storing data. The memory access techniques described herein may be employed for both loading and storing data. Thus, while much of the description provided herein is directed toward exemplary applications of fetching and loading data, it should be understood that the techniques may be likewise applied for storing data. The system may comprise one or more processor cores that utilize the cache-access path for accessing data. The system may further comprise at least one heterogeneous functional unit that is operable to utilize the direct-access path for accessing data. In certain embodiments, the one or more processor cores, cache, and the at least one heterogeneous functional unit may be included on a common semiconductor die (e.g., as part of an integrated circuit). As described further herein, embodiments of the present invention enable improved system performance by selectively employing the cache-access path for certain instructions (e.g., selectively having the processor core(s) process certain instructions) while selectively employing the direct-access path for other instructions (e.g., by offloading those other instructions to the heterogeneous functional unit).
Embodiments of the present invention provide a system in which two memory access paths are employed for accessing data by two or more processing nodes. A first memory access path (which may be referred to herein as a “cache-access path” or a “block-oriented access path”) is a path in which a block of data is fetched from main memory to cache. This cache-access path is similar to the traditional memory access described above, whereby if the desired data is present in cache, it is accessed from the cache and if the desired data is not present in the cache it is fetched from main memory and loaded into the cache. Such fetching may load not only the desired data into cache, but may also load some fixed block of data, commonly referred to as a “cache block” as discussed above (e.g., a 64-byte cache block). A second memory access path (which may be referred to herein as a “direct-access path”, “cache-bypass path”, or “address-oriented access”) enables the cache to be bypassed to retrieve data directly from main memory. In such a direct access, data of an individual physical address that is requested may be retrieved, rather than retrieving a block of data that encompasses more than what is desired.
According to certain embodiments of the present invention the main memory is implemented as non-sequential access main memory that supports random address accesses as opposed to block accesses. That is, upon requesting a given physical address, the main memory may return a corresponding operand (data) that is stored to the given physical address, rather than returning a fixed block of data residing at physical addresses. In other words, rather than returning a fixed block of data (e.g., a 64-byte block of data as described in Table 1 above) independent of the requested physical address, the main memory is implemented such that it is dependent on the requested physical address requested (i.e., is capable of returning only the individual data residing at the requested physical address).
When being accessed directly (via the “direct-access path”), the main memory returns the data residing at a given requested physical address, rather than returning a fixed block of data that is independent (in size) of the requested physical address. Thus, rather than a block-oriented access, an address-oriented access may be performed in which only the data for the requested physical address is retrieved. Further, when being accessed via the cache-access path, the main memory is capable of returning a cache block of data. For instance, the non-sequential access main memory can be used to emulate a block reference when desired for loading to a cache, but also supports individual random address accesses without requiring a block load (e.g., when being accessed via the direct-access path). Thus, the same non-sequential access main memory is utilized (with the same physical memory addresses) for both the direct-access and cache-access paths. According to one embodiment, the non-sequential access main memory is implemented by scatter/gather DIMMs (dual in-line memory modules).
According to certain embodiments, the above-mentioned memory architecture is implemented in a system that comprises at least one processor and at least one heterogeneous functional unit. As an example, a semiconductor die (e.g., die 102 of
The processor(s) may utilize the cache-access path for accessing memory, while the heterogeneous functional unit is operable to utilize the direct-access path. Thus, certain instructions being processed for a given application may be off-loaded from the one or more processors to the heterogeneous functional unit such that the heterogeneous functional unit may take advantage of the cache-bypass path to access memory for processing those off-loaded instructions. For instance, again consider the following FORTRAN loop that may be executed for a given application:
DO I=1, N, 4
A(i)=B(i)+C(i)
END DO
In this loop, every fourth element (or physical memory address) is used, loaded or stored. As discussed above, if a cache-access path is utilized in which a cache block of 8 operands is retrieved for each access of main memory, then only 2 of the 8 operands are used, and 6/8 of the data loaded into the cache and 6/8 of the memory bandwidth is “wasted” in this example. In certain embodiments of the present invention, such DO loop operation may be off-loaded to the heterogeneous functional unit, which may retrieve the individual data elements desired to be accessed directly from the non-sequential access main memory.
As mentioned above, the cache block memory access approach is beneficial in many instances, such as when the data accesses have temporal and/or spatial locality, but such cache block memory access is inefficient in certain instances, such as in the exemplary DO loop operation above. Accordingly, by selectively employing the cache-access path for certain instructions and employing the direct-access path for other instructions, the overall system performance can be improved. That is, by off-loading certain instructions to a heterogeneous functional unit that is operable to bypass cache and access individual data (e.g., random, non-sequential addresses) from main memory, rather than requiring fetching of fixed block size of data from main memory, while permitting the cache block memory access to be utilized by the one or more processors (and thus gain the benefits of the cache for those instructions that have temporal and/or spatial locality), the system performance can be improved.
In certain embodiments, the heterogeneous functional unit implemented comprises a different instruction set than the native instruction set of the one or more processors. Further, in certain embodiments, the instruction set of the heterogeneous functional unit may be dynamically reconfigurable. As an example, in one implementation three (3) mutually-exclusive instruction sets may be pre-defined, any of which may be dynamically loaded to the heterogeneous functional unit. As an illustrative example, a first pre-defined instruction set might be a vector instruction set designed particularly for processing 64-bit floating point operations as are commonly encountered in computer-aided simulations, a second pre-defined instruction set might be designed particularly for processing 32-bit floating point operations as are commonly encountered in signal and image processing applications, and a third pre-defined instruction set might be designed particularly for processing cryptography-related operations. While three illustrative pre-defined instruction sets are described above, it should be recognized that embodiments of the present invention are not limited to the exemplary instruction sets mentioned above. Rather, any number of instruction sets of any type may be pre-defined in a similar manner and may be employed on a given system in addition to or instead of one or more of the above-mentioned pre-defined instruction sets.
Further, in certain embodiments the heterogeneous functional unit contains some operational instructions that are part of the native instruction set of the one or more processors (e.g., micro-cores). For instance, in certain embodiments, the x86 (or other) instruction set may be modified to include certain instructions that are common to both the processor(s) and the heterogeneous functional unit. For instance, certain operational instructions may be included in the native instruction set of the processor(s) for off-loading instructions to the heterogeneous functional unit.
For example, in one embodiment, the instructions of an application being executed are decoded by the one or more processors (e.g., micro-core(s)). Suppose that the processor fetches a native instruction (e.g., X86 instruction) that is called, as an example, “Heterogeneous Instruction 1”. The decode logic of the processor determines that this is an instruction to be off-loaded to the heterogeneous functional unit, and thus in response to decoding the Heterogeneous Instruction 1, the processor initiates a control sequence to the heterogeneous functional unit to communicate the instruction to the heterogeneous functional unit for processing. So, the processor (e.g., micro-core) may decode the instruction and initiate the heterogeneous functional unit via a control line. The heterogeneous functional unit then sends instructions to reference memory via the direct-access path.
In certain embodiments, the heterogeneous functional unit comprises a co-processor, such as the exemplary co-processor disclosed in co-pending and commonly assigned U.S. patent application Ser. No. 11/841,406 filed Aug. 20, 2007 titled “MULTI-PROCESSOR SYSTEM HAVING AT LEAST ONE PROCESSOR THAT COMPRISES A DYNAMICALLY RECONFIGURABLE INSTRUCTION SET”, and U.S. patent application Ser. No. 11/854,432 filed Sep. 12, 2007 titled “DISPATCH MECHANISM FOR DISPATCHING INSTRUCTIONS FROM A HOST PROCESSOR TO A CO-PROCESSOR,” the disclosures of which have been incorporated herein by reference.
According to certain embodiments, an exemplary multi-processor system in which such dispatch mechanism may be employed is described. While an exemplary multi-processor system that comprises heterogeneous processors (i.e., having different instruction sets) is described herein, it should be recognized that embodiments of the dispatch mechanism described herein are not limited to the exemplary multi-processor system described. As one example, according to certain embodiments, a multi-processor system that comprises at least one processor having a dynamically reconfigurable instruction set. According to certain embodiments, at least one host processor is implemented in the system, which may comprise a fixed instruction set, such as the well-known x86 instruction set. Additionally, at least one co-processor is implemented, which comprises dynamically reconfigurable logic that enables the co-processor's instruction set to be dynamically reconfigured. In this manner, the at least one host processor and the at least one dynamically reconfigurable co-processor are heterogeneous processors because the dynamically reconfigurable co-processor may be configured to have a different instruction set than that of the at least one host processor. According to certain embodiments, the co-processor may be dynamically reconfigured with an instruction set for use in optimizing performance of a given executable. For instance, in certain embodiments, one of a plurality of predefined instruction set images may be loaded onto the co-processor for use by the co-processor in processing a portion of a given executable's instruction stream.
In certain embodiments, an executable (e.g., an a.out file or a.exe file, etc.) may include (e.g., in its header) an identification of an instruction set with which the co-processor is to be configured for use in processing a portion of the executable's instruction stream. Accordingly, when the executable is initiated, the system's operating system (OS) may determine whether the co-processor possesses the instruction set identified for the executable. If determined that the co-processor does not possess the identified instruction set, the OS causes the co-processor to be reconfigured to possess such identified instruction set. Then, a portion of the instructions of the executable may be off-loaded for processing by the co-processor according to its instruction set, while a portion of the executable's instructions may be processed by the at least one host processor. Accordingly, in certain embodiments, a single executable may have instructions that are processed by different, heterogeneous processors that possess different instruction sets. As described further herein, according to certain embodiments, the co-processor's instructions are decoded as if they were defined with the host processor's instruction set (e.g., x86's ISA). In essence, to a compiler, it appears that the host processor's instruction set (e.g., the x86 ISA) has been extended.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention, it should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims. The novel features which are believed to be characteristic of the invention, both as to its organization and method of operation, together with further objects and advantages will be better understood from the following description when considered in connection with the accompanying figures. It is to be expressly understood, however, that each of the figures is provided for the purpose of illustration and description only and is not intended as a definition of the limits of the present invention.
For a more complete understanding of the present invention, reference is now made to the following descriptions taken in conjunction with the accompanying drawing, in which:
System 200 employs two memory access paths: 1) a cache-access path in which block data is stored/loaded to/from main memory 201 to/from cache 203, and 2) a direct-access path in which individually-addressed data is stored/loaded to/from main memory 201 (e.g., along path 206 in system 200). For instance, system 200 employs a cache-access path in which block data may be stored to main memory 201 and in which block data may be loaded from main memory 201 to cache 203. Additionally, system 200 employs a direct-access path in which individually-addressed data, rather than a fixed-size block of data, may be stored to main memory 201 and in which individually-addressed data may be loaded from main memory 201 (e.g., along path 206 in system 200) to a processor register (e.g., of heterogeneous functional unit 204).
System 200 comprises two processor cores, 202A and 202B, that utilize the cache-access path for accessing data from main memory 201. System 200 further comprises at least one heterogeneous functional unit 204 that is operable to utilize the direct-access path for accessing data from main memory 201. As described further herein, embodiments of the present invention enable improved system performance by selectively employing the cache-access path for certain instructions (e.g., selectively having the processor core(s) 202A/202B process certain instructions) while selectively employing the direct-access path for other instructions (e.g., by offloading those other instructions to the heterogeneous functional unit 204).
Embodiments of the present invention provide a system in which two memory access paths are employed for accessing data by two or more processing nodes. A first memory access path (which may be referred to herein as a “cache-access path” or a “block-oriented access path”) is a path in which a block of data is fetched from main memory 201 to cache 203. This cache-access path is similar to the traditional memory access described above with
According to certain embodiments of the present invention the main memory is implemented as non-sequential access main memory that supports random address accesses as opposed to block accesses. That is, upon requesting a given physical address, the main memory may return a corresponding operand (data) that is stored to the given physical address, rather than returning a fixed block of data residing at physical addresses. In other words, rather than returning a fixed block of data (e.g., a 64-byte block of data) independent of the requested physical address, the main memory is implemented such that it is dependent on the requested physical address requested (i.e., is capable of returning only the individual data residing at the requested physical address).
According to certain embodiments, processor cores 202A and 202B are operable to access data in a manner similar to that of traditional processor architectures (e.g., that described above with
When being accessed directly (via the “direct-access path” 206), main memory 201 returns the data residing at a given requested physical address, rather than returning a fixed-size block of data that is independent (in size) of the requested physical address. Thus, rather than a block-oriented access, an address-oriented access may be performed in which only the data for the requested physical address is retrieved. Further, when being accessed via the cache-access path, main memory 201 is capable of returning a cache block of data. For instance, the non-sequential access main memory 201 can be used to emulate a block reference when desired for loading a cache block of data to cache 203, but also supports individual random address accesses without requiring a block load (e.g., when being accessed via the direct-access path 206). Thus, the same non-sequential access main memory 201 is utilized (with the same physical memory addresses) for both the cache-access path (e.g., utilized for data accesses by processor cores 202A and 202B in this example) and the direct-access path (e.g., utilized for data access by heterogeneous functional unit 204). According to one embodiment, non-sequential access main memory 201 is implemented by scatter/gather DIMMs (dual in-line memory modules) 21.
Thus, main memory subsystem 201 supports non-sequential memory references. According to one embodiment, main memory subsystem 201 has the following characteristics:
1) Each memory location is individually addressed. There is no built-in notion of a cache block.
2) The entire physical memory is highly interleaved. Interleaving means that each operand resides in its individually controlled memory location.
3) Thus, full memory bandwidth is achieved for a non-sequentially referenced address pattern. For instance, in the above example of the DO loop that accesses every fourth memory address, the full memory bandwidth is achieved for the address reference pattern: Address1, Address5, Address9, and Address13.
4) If the memory reference is derived from a micro-core, then the memory reference pattern is sequential, e.g., physical address reference pattern: Address1, Address2, Address3, . . . Address8 (assuming a cache block of 8 operands or 8 words).
5) Thus, the memory system can support full bandwidth random physical addresses and can also support full bandwidth sequential addresses.
Given a memory system 201 as described above, a mechanism is further provided in certain embodiments to determine whether a memory reference is directed to the cache 203, or directly to main memory 201. In a preferred embodiment of the present invention, a heterogeneous functional unit 204 provides such a mechanism.
In certain embodiments, the determination in block 33 may be made based, at least in part, on the instruction that is fetched. For instance, in certain embodiments, the heterogeneous functional unit 204 contains some operational instructions (in its instruction set) that are part of the native instruction set of the processor cores 202A/202B. For instance, in certain embodiments, the x86 (or other) instruction set may be modified to include certain instructions that are common to both the processor core(s) and the heterogeneous functional unit. For instance, certain operational instructions may be included in the native instruction set of the processor core(s) for off-loading instructions to the heterogeneous functional unit.
For example, in one embodiment, the instructions of an application being executed are decoded by the processor core(s) 202A/202B, wherein the processor core may fetch (in operational block 31) a native instruction (e.g., X86 instruction) that is called, as an example, “Heterogeneous Instruction 1”. The decode logic of the processor core decodes the instruction in block 32 and determines in block 33 that this is an instruction to be off-loaded to the heterogeneous functional unit 204, and thus in response to decoding the Heterogeneous Instruction 1, the processor core initiates a control sequence (via control line 209) to the heterogeneous functional unit 204 to communicate (in operational block 35) the instruction to the heterogeneous functional unit 204 for processing.
In one embodiment, the cache-path access 34 includes the processor core 202A/202B querying, in block 301, the cache 203 for the physical address to determine if the referenced data (e.g., operand) is encached. In block 302, the processor core 202A/202B determines whether the referenced data is encached in cache 203. If it is encached, then operation advances to block 304 where the processor core 202A/202B retrieves the referenced data from cache 203. If determined in block 302 that the referenced data is not encached, operation advances to block 303 where a cache block fetch from main memory 201 is performed to load a fixed-size block of data, including the referenced data, into cache 203, and then operation advances to block 304 where the processor core retrieves the fetched data from cache 203.
In one embodiment, the direct-access path 36 (of
In block 306, heterogeneous functional unit 204 determines whether the referenced data has been previously encached in cache 203. If it has not, operation advances to block 310 where the heterogeneous functional unit 204 retrieves the referenced data of the individually-referenced physical address (e.g., physical address 210 and 207 of
If determined in block 306 that the referenced data has been previously cached, then in certain embodiments different actions may be performed depending on the type of caching employed in the system. For instance, in block 307, a determination may be made as to whether the cache is a write-back caching technique or a write-through caching technique, each of which are well-known caching techniques in the art and are thus not described further herein. If a write-back caching technique is employed, then the heterogeneous functional unit 204 writes the cache block of cache 203 that contains the referenced data back to main memory 201, in operational block 308. If a write-through caching technique is employed, then the heterogeneous functional unit 204 invalidates the referenced data in cache 203, in operational block 309. In either case, operation then advances to block 310 to retrieve the referenced data of the individually-referenced physical address (e.g., physical address 210 and 207 of
In certain embodiments, if a hit is achieved from the cache in the direct-access path 36 (e.g., as determined in block 306), then the request may be completed from the cache 203, rather than requiring the entire data block to be written back to main memory 201 (as in block 308) and then referencing the single operand from main memory 201 (as in block 310). That is, in certain embodiments, if a hit is achieved for the cache 203, then the memory access request (e.g., store or load) may be satisfied by cache 203 for the heterogeneous functional unit 204, and if a miss occurs for cache 203, then the referenced data of the individually-referenced physical address (e.g., physical address 210 and 207 of
For all traditional microprocessors of the prior art, main memory (e.g., 101 of
Typical of these types of applications are those that reference memory using a vector of indices. This is called “scatter/gather”. For example, in the following FORTRAN code:
do i=1,n
a(i)=b(i)+c(i)
enddo
all the elements of a, b, and c are sequentially referenced.
In the following FORTRAN code:
do i=1,n
a(j(i))=b(j(i))+c(j(i))
enddo
a, b, and c are referenced through an index vector. Thus, the physical main memory system is referenced by non-sequential memory addresses.
According to certain embodiments, main memory 201 of system 200 comprises a memory dimm that is formed utilizing standard memory DRAMs, that provides full bandwidth memory accesses for non-sequential memory addresses. Thus, if the memory reference pattern is: 1, 20, 33, 55; then only memory words, 1, 20, 33, and 55 are fetched and stored. In fact, they are fetched and stored at the maximum rate permitted by the DRAMs.
In the above example, with the same memory reference pattern, a block-oriented memory system, with a block size of 8 words, would fetch 4 cache blocks to fetch 4 words:
{1 . . . 8}—for word 1;
{17 . . . 24}—for word 20;
{33 . . . 40}—for word 33; and
{51 . . . 56}—for word 55.
In the above-described embodiment of system 200 of
Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims. Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the disclosure of the present invention, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized according to the present invention. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps.
The present application is a continuation of co-pending, commonly assigned, patent application Ser. No. 11/969,792 entitled “MICROPROCESSOR ARCHITECTURE HAVING ALTERNATIVE MEMORY ACCESS PATHS,” filed Jan. 4, 2008, the disclosure of which is hereby incorporated herein by reference. The present application relates to the following co-pending and commonly-assigned U.S. Patent Applications, U.S. patent application Ser. No. 11/841,406 filed Aug. 20, 2007 titled “MULTI-PROCESSOR SYSTEM HAVING AT LEAST ONE PROCESSOR THAT COMPRISES A DYNAMICALLY RECONFIGURABLE INSTRUCTION SET”, U.S. patent application Ser. No. 11/854,432 filed Sep. 12, 2007 titled “DISPATCH MECHANISM FOR DISPATCHING INSTRUCTIONS FROM A HOST PROCESSOR TO A CO-PROCESSOR”, and U.S. patent application Ser. No. 11/847,169 filed Aug. 29, 2007 titled “COMPILER FOR GENERATING AN EXECUTABLE COMPRISING INSTRUCTIONS FOR A PLURALITY OF DIFFERENT INSTRUCTION SETS”, the disclosures of which are hereby incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
3434114 | Arulpragasam et al. | Mar 1969 | A |
4128880 | Cray, Jr. | Dec 1978 | A |
4386399 | Rasala | May 1983 | A |
4504902 | Gallaher | Mar 1985 | A |
4669043 | Kaplinsky | May 1987 | A |
4685076 | Yoshida et al. | Aug 1987 | A |
4817140 | Chandra et al. | Mar 1989 | A |
4888679 | Fossum | Dec 1989 | A |
4897783 | Nay | Jan 1990 | A |
5027272 | Samuels | Jun 1991 | A |
5097437 | Larson | Mar 1992 | A |
5109499 | Inagami et al. | Apr 1992 | A |
5117487 | Nagata | May 1992 | A |
5202969 | Sato | Apr 1993 | A |
5222224 | Flynn | Jun 1993 | A |
5283886 | Nishii | Feb 1994 | A |
5357521 | Cheng | Oct 1994 | A |
5513366 | Agarwal et al. | Apr 1996 | A |
5598546 | Blomgren | Jan 1997 | A |
5664136 | Witt et al. | Sep 1997 | A |
5752035 | Trimberger | May 1998 | A |
5838946 | Petolino, Jr. | Nov 1998 | A |
5838984 | Nguyen et al. | Nov 1998 | A |
5887182 | Kinoshita | Mar 1999 | A |
5887183 | Agarwal et al. | Mar 1999 | A |
5920721 | Hunter et al. | Jul 1999 | A |
5933627 | Parady | Aug 1999 | A |
5935204 | Shimizu | Aug 1999 | A |
5937192 | Martin | Aug 1999 | A |
5941938 | Thayer | Aug 1999 | A |
5999734 | Willis et al. | Dec 1999 | A |
6006319 | Takahashi et al. | Dec 1999 | A |
6023755 | Casselman | Feb 2000 | A |
6026479 | Fisher et al. | Feb 2000 | A |
6041401 | Ramsey | Mar 2000 | A |
6058056 | Beffa | May 2000 | A |
6075546 | Hussain et al. | Jun 2000 | A |
6076139 | Welker et al. | Jun 2000 | A |
6076152 | Huppenthal et al. | Jun 2000 | A |
6097402 | Case et al. | Aug 2000 | A |
6098169 | Ranganathan | Aug 2000 | A |
6125421 | Roy | Sep 2000 | A |
6154419 | Shakkarwar | Nov 2000 | A |
6170001 | Hinds et al. | Jan 2001 | B1 |
6175915 | Cashman et al. | Jan 2001 | B1 |
6195676 | Spix et al. | Feb 2001 | B1 |
6202133 | Jeddeloh | Mar 2001 | B1 |
6209067 | Collins | Mar 2001 | B1 |
6212544 | Borkenhagen et al. | Apr 2001 | B1 |
6240508 | Brown, III et al. | May 2001 | B1 |
6243791 | Vondran, Jr. | Jun 2001 | B1 |
6308255 | Gorishek, IV | Oct 2001 | B1 |
6308323 | Douniwa | Oct 2001 | B1 |
6339813 | Smith, III et al. | Jan 2002 | B1 |
6342892 | Van Hook | Jan 2002 | B1 |
6345384 | Sato | Feb 2002 | B1 |
6430103 | Nakayama | Aug 2002 | B2 |
6434687 | Huppenthal | Aug 2002 | B1 |
6473777 | Hendler | Oct 2002 | B1 |
6473831 | Schade | Oct 2002 | B1 |
6480952 | Gorishek, IV | Nov 2002 | B2 |
6507894 | Hoshi | Jan 2003 | B1 |
6510164 | Ramaswamy | Jan 2003 | B1 |
6567900 | Kessler | May 2003 | B1 |
6611908 | Lentz et al. | Aug 2003 | B2 |
6665790 | Glossner, III et al. | Dec 2003 | B1 |
6684305 | Deneau | Jan 2004 | B1 |
6701424 | Liao et al. | Mar 2004 | B1 |
6738967 | Radigan | May 2004 | B1 |
6751700 | Donoghue | Jun 2004 | B2 |
6789167 | Naffziger | Sep 2004 | B2 |
6831979 | Callum | Dec 2004 | B2 |
6839828 | Gschwind et al. | Jan 2005 | B2 |
6868472 | Miyake | Mar 2005 | B1 |
6891543 | Wyatt | May 2005 | B2 |
6948158 | Van Gageldonk et al. | Sep 2005 | B2 |
6954845 | Arnold et al. | Oct 2005 | B2 |
6978451 | Heeb | Dec 2005 | B2 |
6983456 | Poznanovic et al. | Jan 2006 | B2 |
7000211 | Arnold | Feb 2006 | B2 |
7028286 | Larin et al. | Apr 2006 | B2 |
7065631 | Weaver | Jun 2006 | B2 |
7120755 | Jamil et al. | Oct 2006 | B2 |
7149867 | Poznanovic et al. | Dec 2006 | B2 |
7167971 | Asaad et al. | Jan 2007 | B2 |
7225324 | Huppenthal et al. | May 2007 | B2 |
7228531 | Langhammer | Jun 2007 | B1 |
7237088 | Barry et al. | Jun 2007 | B2 |
7257757 | Chun et al. | Aug 2007 | B2 |
7278122 | Willis | Oct 2007 | B2 |
7313673 | Abernathy et al. | Dec 2007 | B2 |
7328195 | Willis | Feb 2008 | B2 |
7360060 | Chauvel | Apr 2008 | B2 |
7367021 | Ansari et al. | Apr 2008 | B2 |
7376812 | Sanghavi | May 2008 | B1 |
7418571 | Wolrich et al. | Aug 2008 | B2 |
7421565 | Kohn | Sep 2008 | B1 |
7434029 | Chauvel | Oct 2008 | B2 |
7508325 | Brokenshire et al. | Mar 2009 | B2 |
7543282 | Chou | Jun 2009 | B2 |
7546441 | Ansari et al. | Jun 2009 | B1 |
7577822 | Vorbach | Aug 2009 | B2 |
7643353 | Srinivasan et al. | Jan 2010 | B1 |
7665078 | Liebenow | Feb 2010 | B2 |
7886129 | Hudepohl | Feb 2011 | B2 |
7895585 | Prakash et al. | Feb 2011 | B2 |
7904703 | Kang et al. | Mar 2011 | B1 |
8020169 | Yamasaki | Sep 2011 | B2 |
8095735 | Brewer et al. | Jan 2012 | B2 |
8095778 | Golla | Jan 2012 | B1 |
8122229 | Wallach | Feb 2012 | B2 |
8136102 | Papakipos et al. | Mar 2012 | B2 |
8156307 | Wallach | Apr 2012 | B2 |
8196127 | Gschwind | Jun 2012 | B2 |
8205066 | Brewer | Jun 2012 | B2 |
8327325 | Chung et al. | Dec 2012 | B2 |
8458629 | Caldwell et al. | Jun 2013 | B2 |
8484588 | Ng et al. | Jul 2013 | B2 |
8972958 | Brewer | Mar 2015 | B1 |
9069553 | Zaarur et al. | Jun 2015 | B2 |
9710384 | Wallach | Jul 2017 | B2 |
20010004753 | Dell | Jun 2001 | A1 |
20010011342 | Pechanek et al. | Aug 2001 | A1 |
20010049816 | Rupp | Dec 2001 | A1 |
20020013892 | Gorishek et al. | Jan 2002 | A1 |
20020046324 | Barroso et al. | Apr 2002 | A1 |
20020099907 | Castelli | Jul 2002 | A1 |
20020100029 | Bowen | Jul 2002 | A1 |
20020135583 | Ohba | Sep 2002 | A1 |
20020144061 | Faanes | Oct 2002 | A1 |
20030005424 | Ansari et al. | Jan 2003 | A1 |
20030046521 | Shoemaker | Mar 2003 | A1 |
20030140222 | Ohmi et al. | Jul 2003 | A1 |
20030226018 | Tardo | Dec 2003 | A1 |
20040003170 | Gibson et al. | Jan 2004 | A1 |
20040019762 | Fukuoka | Jan 2004 | A1 |
20040068622 | Van Doren | Apr 2004 | A1 |
20040088524 | Chauvel | May 2004 | A1 |
20040107331 | Baxter | Jun 2004 | A1 |
20040117599 | Mittal et al. | Jun 2004 | A1 |
20040153610 | Chi | Aug 2004 | A1 |
20040193837 | Devaney et al. | Sep 2004 | A1 |
20040193852 | Johnson | Sep 2004 | A1 |
20040194048 | Arnold | Sep 2004 | A1 |
20040215898 | Arimilli et al. | Oct 2004 | A1 |
20040221127 | Ang | Nov 2004 | A1 |
20040236920 | Sheaffer | Nov 2004 | A1 |
20040243984 | Vorbach et al. | Dec 2004 | A1 |
20040250046 | Gonzalez et al. | Dec 2004 | A1 |
20040267970 | Jirgal | Dec 2004 | A1 |
20050027970 | Arnold et al. | Feb 2005 | A1 |
20050044539 | Liebenow | Feb 2005 | A1 |
20050108503 | Sandon et al. | May 2005 | A1 |
20050125754 | Schubert et al. | Jun 2005 | A1 |
20050149931 | Lin et al. | Jul 2005 | A1 |
20050172099 | Lowe | Aug 2005 | A1 |
20050188368 | Kinney | Aug 2005 | A1 |
20050193359 | Gupta et al. | Sep 2005 | A1 |
20050198442 | Mandler | Sep 2005 | A1 |
20050223369 | Chun et al. | Oct 2005 | A1 |
20050262278 | Schmidt | Nov 2005 | A1 |
20060075060 | Clark | Apr 2006 | A1 |
20060095901 | Brokenshire et al. | May 2006 | A1 |
20060149941 | Colavin et al. | Jul 2006 | A1 |
20060259737 | Sachs et al. | Nov 2006 | A1 |
20060288191 | Asaad et al. | Dec 2006 | A1 |
20070005881 | Garney | Jan 2007 | A1 |
20070005932 | Covelli et al. | Jan 2007 | A1 |
20070038843 | Trivedi et al. | Feb 2007 | A1 |
20070106833 | Rankin et al. | May 2007 | A1 |
20070130445 | Lau et al. | Jun 2007 | A1 |
20070153907 | Mehta et al. | Jul 2007 | A1 |
20070157166 | Stevens | Jul 2007 | A1 |
20070186210 | Hussain et al. | Aug 2007 | A1 |
20070204107 | Greenfield | Aug 2007 | A1 |
20070226424 | Clark | Sep 2007 | A1 |
20070245097 | Gschwind et al. | Oct 2007 | A1 |
20070283336 | Gschwind et al. | Dec 2007 | A1 |
20070288701 | Hofstee et al. | Dec 2007 | A1 |
20070294666 | Papakipos et al. | Dec 2007 | A1 |
20080059758 | Sachs | Mar 2008 | A1 |
20080059759 | Sachs | Mar 2008 | A1 |
20080059760 | Sachs | Mar 2008 | A1 |
20080104365 | Kohno et al. | May 2008 | A1 |
20080115113 | Codrescu et al. | May 2008 | A1 |
20080177996 | Simar et al. | Jul 2008 | A1 |
20080209127 | Brokenshire | Aug 2008 | A1 |
20080215854 | Asaad | Sep 2008 | A1 |
20090064095 | Wallach et al. | Mar 2009 | A1 |
20090144690 | Spackman et al. | Jun 2009 | A1 |
20090172364 | Sprangle et al. | Jul 2009 | A1 |
20090177843 | Wallach | Jul 2009 | A1 |
20090219779 | Mao et al. | Sep 2009 | A1 |
20100002572 | Garrett | Jan 2010 | A1 |
20100036997 | Brewer et al. | Feb 2010 | A1 |
20100070516 | Adler | Mar 2010 | A1 |
20100138587 | Hutson | Jun 2010 | A1 |
20110055516 | Willis | Mar 2011 | A1 |
20110276787 | Koga et al. | Nov 2011 | A1 |
20120036514 | Master et al. | Feb 2012 | A1 |
20120042121 | Kim et al. | Feb 2012 | A1 |
20120131309 | Johnson et al. | May 2012 | A1 |
20120192163 | Glendenning et al. | Jul 2012 | A1 |
Number | Date | Country |
---|---|---|
0945788 | Sep 1999 | EP |
1 306 751 | May 2003 | EP |
WO-2008014494 | Jan 2008 | WO |
Entry |
---|
Keswani, R., “Computational Model for Re-entrant Multiple Hardware Threads,” Electronics and Communication Engineering, Osmania University, Hyderabad, India, 2002, 88 pages. |
Hauck, S. et al., “Sotware Technologies for Reconfigurable Systems,” Northwestern University, Dept. of ECE, Technical Report, 1996, 40 pages. |
Leidel,J.D. et al., “CHOMP: A Framework and Instruction Set for Latency Tolerant, Massively Multithreaded Processors,” Convey Computer Corporation, Richardson, Texas, 12 pages. |
Bombieri, N. et al., HIFSuite: Tools for HDL Code Conversion and Manipulation, Dec. 1, 2009, Accepted Oct. 12, 2010, Hindawi Publishing Corporation, 20 pages. |
FreeBSD, “Manual Reference Pages—A.OUT (5),” Jun. 5, 1993, 6 pages. |
Alverson, R. et al., “The Tera Computer System,” Tera Computer Company, Seattle Washington, ACM 1990, 6 pages. |
Bekerman, M. et al., “Performance and Hardware Complexity Tradeoffs in Designing Multithreaded Architectures,” IEEE Proceedings of PACT 1996, pp. 24-34, 11 pages. |
Tumeo, A. et al., “Designing Next-Generation Massively Multithreaded Architectures for Irregular Applications,” Pacific Northwest National Laboratory, IEEE Computer Society, Aug. 2012, 9 pages. |
Kumar, R. et al., “Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction,” 36th Annual IEEE/ACM International Symposium on Microarchitecture 2003, Dec. 2003, Retrieved on [Jun. 7, 2013] Retrieved from the Internet: URL<http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1253185>, 12 pages. |
Fidanci, O.D. et al., “Performance and Overhead in a Hybrid Reconfigurable Computer,” Proceedings. International Parallel and Distributed Processing Symposium 2003, Apr. 2003, Retrieved on [Jun. 7, 2013], Retrieved from the Internet: URL<http:/ /ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1213325>, 8 pages. |
Vassiliadis, N. et al., “The ARISE Reconfigurable Instruction Set Extensions Framework,” Sect. of Electronics & Computers, Department of Physics, Aristotle University of Thessaloniki, Jul. 16, 2007, 8 pages. |
Estrin, G., “Organization of Computer Systems—The Fixed Plus Variable Structure Computer,” Department of Engineering and Department of Mathematics, University of California, Los Angeles, 1960, 8 pages. |
Hauck, S. “The Roles of FPGA's in Reprogrammable Systems,” Proceedings of the IEEE, vol. 86, No. 4, 615-638, Apr. 1998, 24 pages. |
Shirazi, N. et al., “Run-Time Management of Dynamically Reconfigurable Designs,” Field-Programmable Logic and Applications from FPGAs to Computing Paradigm, 1998, 10 pages. |
Page, I., “Reconfigurable Processor Architectures,” Microprocessors and Microsystems, vol. 20, Issue 3, May 1996, pp. 185-196, 12 pages. |
Callahan, T. J. et al., “The Garp Architecture and C Compiler”, IEEE Computer, vol. 33, No. 4. pp. 62-69, Apr. 2000, 8 pages. |
Agarwal, B., “Instruction Fetch Execute Cycle,” CS 518 Fall 2004, 10 pages. |
The Cell Project at IBM Research, “The Cell Synergistic Processor Unit (SPU),” http://www.research.ibm.com/cell/SPU.html, 1 page. |
The Cell Project at IBM Research, “Heterogeneous Chip Multiprocessing,” http://www.research.ibm.com/cell/heterogeneousCMP.html, 1 page. |
Eichenberger, A.E. et al., “Using Advanced Compiler Technology to Exploit the Performance of the Cell Broadband Engine™ Architecture,” IBM Systems Journal, vol. 45, No. 1, 2006, 26 pages. |
The PC Guide, “The PC's x86 Instruction Set,” http://www.pcguide.com/ref/cpu.arch/int/instX86-c.html, 2004, 3 pages. |
Koch, A. et al., “A Universal Co-Processor for Workstations,” Abingdon EE&CS Books, 1994, 14 pages. |
Bhuyan, L.N.,“Lecture 15: Symmetric Multiprocessor: Cache Protocols,” Feb. 28, 2001, 16 pages. |
Levine, B. A. et al., “Efficient Application Representation for HASTE: Hybrid Architectures with a Single, Transformable Executable, ” Proceedings of the IEEE Symp. on FCCM, 2003, 10 pages. |
Siewiorek, D. P. et al., “Computer Structures: Principles and Examples” McGraw-Hill. 1982, p. 334, Figure 1(a), 2 pages. |
Cray XD1™ FPGA Development, Release 1.2; S-6400-12, issued Apr. 18, 2005. Available at www.eng.uah.edu/˜jacksoa/CrayXD1FPGADevelopment.pdf, 74 pages. |
Poster entitled “GigaScale Mixed-Signal System Verification,” FTL Systems, Inc. presented at the DARPA/MTO Team/NeoCAD2003 Fall Review, Sep. 15-17, 2003, Monterey, CA, a public unclassified meeting, 1 page. |
Poster entitled “StarStream™ GigaScale Mixed-Signal System Verification,” FTL Systems, Inc. presented at the DARPA/MTO Team/NeoCAD Program Review, Feb. 23, 2004, Scottsdale, AZ, Monterey, CA, a public unclassified meeting, 1 page. |
StarStream Design Summary; FTL Systems, Inc., available at Design Automation Conference (DAC), Jun. 13-17, 2005, Anaheim, CA., 8 pages. |
Gokhale, M. “Heterogeneous Processing,” Los Alamos Computer Science Institute LACSI 2006, Oct. 17-19, 2006, Santa Fe, NM. Available at www.cct.lsu.edu˜estrabd/LACSI2006/workshops/workshop5/gokhale_mccormick.pdf, 24 pages. |
Belgard, R., “Reconfigurable Illogic”, Microprocessor, The Insiders Guide to Microprocessor Hardware, May 10, 2004, 4 pages. |
Tredennick, N. et al., “Microprocessor Sunset,” Microprocessor, The Insiders Guide to Microprocessor Hardware, May 3, 2004, 4 pages. |
Arnold, J. M., et al. “The Splash 2 Processor and Applications,” 1993, IEEE, pp. 482-485, 4 pages. |
Gokhale, M. et al., “Reconfigurable Computing,” Accelerating Computation with Field-Programmable Gate Arrays, ©Springer 2005, pp. 4 and 60-64, 6 pages. |
Xess Corporation, “XSA Board V1.1, V1.2 User Manua,'” Release Date: Jun. 23, 2005, 48 pages. |
Xess, “XSA-50 Spartan-2 Prototyping Board with 2.5V, 50,000-gate FPGA,” (copyright 1998-2008), http://www.xess.com/prod027.php3, 2 pages. |
The Extended European Search Report issued for EP 08827665.4, dated Nov. 5, 2010, 6 pages. |
International Search Report and Written Opinion issued for PCT/US2009/051096, dated Oct. 26, 2009, 9 pages. |
International Search Report and Written Opinion issued for PCT/US09/60820, dated Dec. 9, 2009, 8 pages. |
International Search Report and Written Opinion issued for PCT/US09/60811, dated Dec. 1, 2009, 7 pages. |
International Search Report and Written Opinion issued for PCT/US2013/042439, dated Dec. 16, 2013, 10 pages. |
International Search Report and Written Opinion issued for PCT/US08/74566 dated Nov. 14, 2008, 9 pages. |
International Search Report and Written Opinion issued for PCT/US08/73423 dated Nov. 12, 2008, 12 pages. |
International Search Report and Written Opinion issued for PCT/US08/75828 dated Nov. 18, 2008, 12 pages. |
International Search Report and Written Opinion issued for PCT/US08/87233, dated Feb. 5, 2009, 11 pages. |
Number | Date | Country | |
---|---|---|---|
20170249253 A1 | Aug 2017 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 11969792 | Jan 2008 | US |
Child | 15596649 | US |