COMPUTER SYSTEMS WITH LIGHTWEIGHT MULTI-THREADED ARCHITECTURES

Abstract
Embodiments of the present invention provide a class of computer architectures generally referred to as lightweight multi-threaded architectures (LIMA). Other embodiments may be described and claimed.
Description

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the present invention will be readily understood by the following detailed description in conjunction with the accompanying drawings. To facilitate this description, like reference numerals designate like structural elements. Embodiments of the invention are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings.



FIG. 1 is a schematic diagram of a lightweight processing chip (LPC), in accordance with various embodiments of the present invention;



FIG. 2 is a schematic diagram of a locale made up of multiple LPCs, in accordance with various embodiments of the present invention;



FIG. 3 schematically illustrates a module with a single LPC, in accordance with various embodiments of the present invention;



FIG. 4 schematically illustrates a module with multiple LPCs, in accordance with various embodiments of the present invention;



FIG. 5 schematically illustrates lightweight processing core and memory macro on-chip relationships, in accordance with various embodiments of the present invention;



FIG. 6 schematically illustrates an LWP subsystem, in accordance with various embodiments of the present invention;



FIG. 7 schematically illustrates thread management, in accordance with various embodiments of the present invention;



FIG. 8 schematically illustrates a memory word, in accordance with various embodiments of the present invention;



FIG. 9 schematically illustrates a data memory address to an aligned xdword, in accordance with various embodiments of the present invention;



FIG. 10 schematically illustrates an instruction address, in accordance with various embodiments of the present invention;



FIG. 11 schematically illustrates extended memory state encodings, in accordance with various embodiments of the present invention;



FIG. 12 schematically illustrates memory operations, in accordance with various embodiments of the present invention;



FIG. 13 schematically illustrates changes of memory state as a function of operation performed, in accordance with various embodiments of the present invention;



FIG. 14 schematically illustrates value returned by memory operations as a reply to an original requestor, in accordance with various embodiments of the present invention;



FIG. 15 schematically illustrates additional operations performed at target memory location, in accordance with various embodiments of the present invention;



FIG. 16 schematically illustrates state changes for register during instruction execution, in accordance with various embodiments of the present invention; and



FIG. 17 schematically illustrates a thread status word, in accordance with various embodiments of the present invention.


Claims
  • 1. A computing system comprising: one or more nodes, each node comprising at least one lightweight processing chip (LPC) that includes a lightweight processor (LWP) core and at least one memory module, each node being adapted to concurrently execute a number of independent program threads on behalf of one or more application programs, and each thread being adapted to generate one or more requests to access memory anywhere in the computing system; andan interconnect network communicatively coupling multiple nodes such that LPCs within a node may issue a memory or thread creation request that may be routed to a node that includes designated memory locations and return one of a copy of the data or completion status back to a requesting LPC.
  • 2. The computing system of claim 1, wherein each LPC is adapted to receive memory requests and are adapted to generate memory requests.
  • 3. The computing system of claim 2, further comprising memory within the computing system that is external to the nodes.
  • 4. The computing system of claim 3, further comprising an internal node routing system adapted to facilitate memory requests between LPCs within a node and to facilitate communication with the interconnect network for memory requests elsewhere within the computing system.
  • 5. The computing system of claim 1, further comprising at least one heavyweight processor (HWP) communicatively coupled to the interconnect network and adapted to generate streams of memory reference requests.
  • 6. The computing system of claim 5, wherein the at least one HWP does not include program visible memory.
  • 7. The computing system of claim 6, further comprising at least one cache and/or machine register.
  • 8. The computing system of claim 1, wherein the computing system is a massively parallel computing system.
  • 9. The computing system of claim 1, wherein at least some of the LPCs include multiple memory modules.
  • 10. The computing system of claim 9, wherein at least some of the LPCs include multiple LWPs.
  • 11. The computing system of claim 10, wherein each LWP is adapted to execute programs and generate memory requests.
  • 12. The computing system of claim 11, wherein each LPC includes an interconnection network that allows memory requests from each LWP to reach the memory modules, caches within the LWPs, and ports to the node interconnect network.
  • 13. The computing system of claim 1, wherein each memory module comprises memory locations and each memory location has associated with it a value field and an extension field.
  • 14. The computing system of claim 13, wherein the extension field has at least two possible settings: full, which indicates that the value field contains data to be interpreted as a series of information bits to be interpreted by some program that accesses it; andextended semantics, which indicates that the value field has information that is to be interpreted by the memory interface to control how any memory request that accesses the location is to be performed.
  • 15. The computing system of claim 14, wherein states that a memory location may be in when the extension field is set includes at least one from a group comprising: an indication that the memory has not been initialized;an indication that it contains an error code;an indication that the location is locked from some type of access;an indication that the memory location is logically empty;an indication that the memory location is not only logically empty, but that it is a register in some thread frame, and a next instruction for a program associated with that thread requires a value from the register before it may continue;an indication that the location is actually a register in some thread frame, that some instruction for a program for that thread has designated this register to receive the result or status from some prior memory request, and that the next instruction for that program requires completion of that memory operation before it may continue;the address of some other location to which any request to this location should be forwarded, including options on what to leave behind in this location after the forwarding has occurred; andinformation that may be used to start a new thread within an LPC controlling the memory location whenever any sort of memory request attempts to access the location.
  • 16. The computing system of claim 15, wherein a suite of memory operations that may be generated by a program includes at least one from a group comprising: reads and writes that may be blocked, forwarded, or responded to with an error code, depending on an extended state;extended reads and writes that override the state of a target location and allow complete access to the location without state interpretation;options on reads that will convert the state of a location to empty after an access;options on reads that will convert the state of a location to locked after an access;options on writes that change the contents of the target memory location only if the initial state was empty;atomic memory operations that perform a read-compute-write against a target memory location without allowing any other access to that location to occur during the sequence; andwrites that expect to be targeting a memory location that is also a register in some frame, and that will awaken the thread associated with that frame if it is currently stalled on that register.
  • 17. The computing system of claim 16, wherein an instruction set for an LWP includes at least one from a group comprising: generate specialized memory requests;explicitly set the state of one of its corresponding registers without regard to its current state;test the state of one of its corresponding registers without blocking;designate that a memory frame at some address is now to be considered active;evict itself from the current pool of threads from the current LWP;evict all current frames from the current LWP;terminate its own existence as an active thread;place some other thread in a suspended state; andtest an address associated with a pending memory request as recorded in a register to see if that address potentially matches some other address.
  • 18. The computing system of claim 11, wherein at least one LWP includes a pool of information defining one or more separate program threads, each of which has associated with it a frame comprising one or more unique registers, that its program may manipulate.
  • 19. The computing system of claim 18, wherein at least one LWP includes logic adapted to decide which thread is to be allowed to execute an instruction from that thread's program, and also decide when it is appropriate to evict a frame and/or bring in a new frame corresponding to a different thread than any currently executing.
  • 20. The computing system of claim 11, wherein at least one LWP comprises an optional instruction cache to hold blocks of program text for the threads.
  • 21. The computing system of claim 18, wherein at least one LWP comprises logic adapted to decode instructions from a chosen frame, and determine which registers from the thread's frame are to be accessed.
  • 22. The computing system of claim 18, wherein at least one LWP includes a frame cache that contains one or more frames in support of one or more threads that may be accessed either to support an instruction being executed by some instruction for an owning thread, or receive response messages from previously issued memory requests.
  • 23. The computing system of claim 22, wherein at least one LWP includes logic adapted to access registers for a particular instruction, either from the frame cache or memory, and test their contents before further processing.
  • 24. The computing system of claim 18, wherein at least one LWP includes an execution pipeline capable of executing one or more instructions, from the same or different threads, generating memory requests as called for, testing for exceptions, and writing back results when available to the appropriate register in an appropriate frame.
  • 25. The computing system of claim 18, wherein each frame of registers is contained in a sequential block of known length that has a unique address in memory.
  • 26. The computing system of claim 22, wherein memory operations from any thread that target any register will be routed to either the frame cache holding the most recent copy of the frame, or to memory if the frame is not currently in a frame cache.
  • 27. The computing system of claim 17, wherein multiple memory operations may be issued concurrently and extended semantics within a register are provided to indicate status of an operation.
  • 28. The computing system of claim 27, wherein memory hazards may be avoided among multiple memory operations based upon status of an operation.
  • 29. The computing system of claim 15, wherein memory locations may be in one of a plurality of the states.
Provisional Applications (1)
Number Date Country
60774559 Feb 2006 US