1. Field of the Invention
The field of the invention is generally related to design structures, and more specifically design structures for administering an access conflict in a computer memory cache.
2. Description of Related Art
Computer memory caches are organized in ‘cache lines,’ segments of memory typically of the size that is used to write and read from main memory. The superscalar computer processors in contemporary usage implement multiple execution units for multiple processing pipelines executing microinstructions in microcode, thereby making possible simultaneous access by two different pipelines of execution to exactly the same memory cache line at the same time. The size of the cache lines is larger than the size of typical read and writes from a superscalar computer processor to and from memory. If, for example, a processor reads and writes memory in units of bytes, words (two bytes), double words (four bytes), and quad words (eight bytes), the processor's cache lines may be as eight bytes (32 bits) or sixteen bytes (64 bits)—so that all reads and writes between the processor and the cache will fit into one cache line. In such a system, however, a store microinstruction and a read microinstruction, neither of which accesses the same memory location, can nevertheless both access the same cache line—because the memory locations addressed, although different, are both within the same cache line. This pattern of events is referred to as an access conflict in a computer memory cache.
In a typical memory cache, the read and write electronics each require exclusive access to each cache line when writing or reading data to or from the cache line—so that a simultaneous read and write to the same cache line cannot be conducted on the same clock cycle. This means that when an access conflict exists, either the load microinstruction or the store microinstruction must be delayed or ‘stalled.’ Prior art methods of administering access conflicts allow the store microinstruction to be stalled to a subsequent clock cycle while the load microinstruction proceeds to execute as scheduled on a current clock cycle. Such a priority scheme impacts performance because subsequent stores cannot be retired before a previously stalled store microinstruction completes—because stores are always completed by processor execution units in order—and this implementation increases the probability of stalled stores. Routinely allowing stalled stores therefore risks considerable additional disruption of processing pipelines in contemporary computer processors.
Methods and apparatus are disclosed for administering an access conflict in a computer memory cache so that a conflicting store microinstruction is always given priority over a corresponding load microinstruction—thereby eliminating the risk of stalling subsequent store microinstructions. More particularly, methods and apparatus are disclosed for administering an access conflict in a computer memory cache that include receiving in a memory cache controller a write address and write data from a store memory instruction execution unit of a superscalar computer processor and a read address for read data from a load memory instruction execution unit of the superscalar computer processor, for the write data to be written to and the read data to be read from a same cache line in the computer memory cache simultaneously on a current clock cycle; storing by the memory cache controller the write data in the same cache line on the current clock cycle; stalling, by the memory cache controller in the load memory instruction execution unit, a corresponding load microinstruction; and reading by the memory cache controller from the computer memory cache on a subsequent clock cycle read data from the read address.
In one embodiment, a design structure embodied in a machine readable storage medium for at least one of designing, manufacturing, and testing a design is provided. The design structure includes an apparatus for administering an access conflict in a computer memory cache. The apparatus includes the computer memory cache, a computer memory cache controller, and a superscalar computer processor. The computer memory cache is operatively coupled to the superscalar computer processor through the computer memory cache controller. The apparatus is configured to be capable of receiving in the memory cache controller a write address and write data from a store memory instruction execution unit of the superscalar computer processor and a read address for read data from a load memory instruction execution unit of the superscalar computer processor, for the write data to be written to and the read data to be read from a same cache line in the computer memory cache simultaneously on a current clock cycle, storing by the memory cache controller the write data in the same cache line on the current clock cycle, stalling, by the memory cache controller in the load memory instruction execution unit, a corresponding load microinstruction, and reading by the memory cache controller from the computer memory cache on a subsequent clock cycle read data from the read address.
The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular descriptions of exemplary embodiments of the invention as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts of exemplary embodiments of the invention.
Exemplary methods, systems, and products for administering an access conflict in a computer memory cache according to embodiments of the present invention are described with reference to the accompanying drawings, beginning with
The processor (156) is a superscalar processor that includes more than one execution unit (100, 102). A superscalar processor is a computer processor includes multiple execution units to allow the processing in multiple pipelines of more than one instruction at a time. A pipeline is a set of data processing elements connected in series within a processor, so that the output of one processing element is the input of the next one. Each element in such a series of elements is referred to as a ‘stage,’ so that pipelines are characterized by a particular number of stages, a three-stage pipeline, a four-stage pipeline, and so on. All pipelines have at least two stages, and some pipelines have more than a dozen stages. The processing elements that make up the stages of a pipeline are the logical circuits that implement the various stages of an instruction (address decoding and arithmetic, register fetching, cache lookup, and so on). Implementation of a pipeline allows a processor to operate more efficiently because a computer program instruction can execute simultaneously with other computer program instructions, one in each stage of the pipeline at the same time. Thus a five-stage pipeline can have five computer program instructions executing in the pipeline at the same time, one being fetched from a register, one being decoded, one in execution in an execution unit, one retrieving additional required data from memory, and one having its results written back to a register, all at the same time on the same clock cycle.
The superscalar processor (156) is driver by a clock (not shown). The processor is made up of internal networks of static and dynamic logic: gates, latches, flip flops, and registers. When the clock arrives, dynamic elements (latches, flip flops, and registers) take their new value and the static logic then requires a period of time to decode the new values. Then the next clock pulse arrives and the dynamic elements again take their new values, and so on. By breaking the static logic into smaller pieces and inserting dynamic elements between the pieces of static logic, the delay before the logic gives valid outputs is reduced, which means that the clock period can be reduced—and the processor can run faster.
The superscalar processor (156) can be viewed as providing a form of “internal multiprocessing,” because multiple execution units can operate in parallel inside the processor on more than one instruction at the same time. Many modern processors are superscalar; some have more parallel execution units than others. An execution unit is a module of static and dynamic logic within the processor that is capable of executing a particular class of instructions, memory I/O, integer arithmetic, Boolean logical operations, floating point arithmetic, and so on. In a superscalar processor, there is more than one execution unit of the same type, along with additional circuitry to dispatch instructions to the execution units. For instance, most superscalar designs include more than one integer arithmetic/logic unit (‘ALU’). The dispatcher reads instructions from memory and decides which ones can be run in parallel, dispatching them to the two units.
The computer of
Main memory is organized in ‘pages.’ A cache frame is a portion of cache memory sized to accommodate a memory page. Each cache frame is further organized into memory segments each of which is called a ‘cache line.’ Cache lines may vary in size, for example, from 8 to 516 bytes. The size of the cache line typically is designed to be larger than the size of the usual access requested by a program instruction, which ranges from 1 to 16 bytes, a byte, a word, a double word, and so on.
The computer in the example of
The actual stores and loads of data to and from the cache are carried out by the cache controller (104). In this example, the cache controller (104) has separate interconnections (103, 105) respectively to a load memory instruction execution unit (100) and a store memory instruction execution unit (102), and the cache controller (104) is capable of accepting simultaneously from the execution units in the processor (156) both a store instruction and a load instruction at the same time. The cache controller (104) also has separate interconnections (107, 109) with the computer memory cache (108) for loading and storing data in the cache, and the cache controller (104) is capable of simultaneously, on the same clock cycle, both storing data in the cache and loading data from the cache—so long as the data to be loaded and the data to be stored are in separate cache lines within the cache.
In the example of
If, as here where there is an access conflict, the read and the write are directed to the same cache line at the same time, the memory cache controller will stall a processor operation of some kind in order to allow either the read or the write to occur on a subsequent clock cycle. In this example, the memory cache controller (104) is configured to store the write data in the same cache line on the current clock cycle; stall the corresponding load microinstruction in the load memory instruction execution unit (100); and read the read data from the read address in the computer memory cache (108) on a subsequent clock cycle. The corresponding load microinstruction is ‘corresponding’ in the sense that it is the load microinstruction that caused the read address to be presented to the cache memory controller at the same time as the write address directed to the same cache line.
In the example computer of
Computer (152) of
Computer (152) of
The example voice server of
The exemplary computer (152) of
The example multimodal device of
For further explanation,
The processor (156) includes a decode engine (122), a dispatch engine (124), an execution engine (140), and a writeback engine (155). Each of these engines is a network of static and dynamic logic within the processor (156) that carries out particular functions for pipelining program instructions internally within the processor. The decode engine (122) retrieves machine code instructions from registers in the register set and decodes the machine instructions into microinstructions. The dispatch engine (124) dispatches microinstructions to execution units in the execution engine. Execution units in the execution engine (140) execute microinstructions, and the writeback engine (155) writes the results of execution back into the correct registers in the register file (126).
The processor (156) includes a decode engine (122) that reads a user-level computer program instruction and decodes that instruction into one or more microinstructions for insertion into a microinstruction queue (110). Just as a single high level language instruction is compiled and assembled to a series of machine instructions (load, store, shift, etc), each machine instruction is in turn implemented by a series of microinstructions. Such a series of microinstructions is sometimes called a ‘microprogram’ or ‘microcode.’ The microinstructions are sometimes referred to as ‘micro-operations,’ ‘micro-ops,’ or ‘pops’—although in this specification, a microinstruction is usually referred to as a ‘microinstruction.’
Microprograms are carefully designed and optimized for the fastest possible execution, since a slow microprogram would yield a slow machine instruction which would in turn cause all programs using that instruction to be slow. Microinstructions, for example, may specify such fundamental operations as the following:
For a further example: A typical assembly language instruction to add two numbers, such as, for example, ADD A, B, C, may add the values found in memory locations A and B and then put the result in memory location C. In processor (156), the decode engine (122) may break this user-level instruction into a series of microinstructions similar to:
It is these microinstructions that are then placed in the microinstruction queue (110) to be dispatched to execution units.
Processor (156) also includes a dispatch engine (124) that carries out the work of dispatching individual microinstructions from the microinstruction queue to execution units. The processor (156) includes an execution engine that in turn includes several execution units, two load memory instruction execution units (130, 100), two store memory instruction execution units (132, 102), two ALUs (134, 136), and a floating point execution unit (138). The microinstruction queue in this example includes a first store microinstruction (112), a corresponding load microinstruction (114), and a second store microinstruction (116). The load instruction (114) is said to correspond to the first store instruction (112) because the dispatch engine (124) dispatches both the first store instruction (112) and its corresponding load instruction (114) into the execution engine (140) at the same time, on the same clock cycle. The dispatch engine can do so because the execution engine support two pipelines of execution, so that two microinstructions can move through the execution portion of the pipelines at exactly the same time.
In this example, the dispatch engine (124) detects no dependency between the first store microinstruction (112) and the corresponding load microinstruction (114), despite the fact that both instructions address memory in the same cache line, because the memory locations addressed are not the same. The memory addresses are in the same cache line, but that fact is unknown to the dispatch engine (124). As far as the dispatch engine is concerned, the load microinstruction (114) is to read data from a memory address that is different from the memory address to which the first store instruction (112) is to write data. From the point of view of the dispatch engine, therefore, there is no reason not to allow the first store microinstruction and the corresponding load microinstruction to execute at the same time. From the point of view of the dispatch engine, there is no reason to require the load microinstruction to wait for completion of the first store microinstruction.
The example apparatus of
In this example, the memory cache (108) is shown with only two frames: frame 0 and frame 1. The use of two frames in this example is only for ease of explanation. As a practical matter, such a memory cache may include any number of associative frame ways as may occur to those of skill in the art. In apparatus where the computer memory cache is configured as a set associative cache memory having a capacity of more than one frame of memory, then the fact that write data is to be written to and read data to be read from a same cache line in the computer memory cache means that the write data are to be written to and the read data are to be read from the same cache line in the same frame in the computer memory cache.
In the example of
The address comparison circuit (148) compares the write address and the read address to determine whether the two addresses access the same cache line. A determination that the two addresses access the same cache line is a determination that by the address comparison circuitry of the computer memory cache controller that the write data are to be written to and the read data are to be read from the same cache line. If the two addresses access the same cache line, as they do in this example, then the address comparison circuit signals the load memory instruction execution unit in which the load microinstruction is dispatched, by use of the stall output line (150), to stall the corresponding load microinstruction. That is, stalling the corresponding load microinstruction is carried out by signaling, by the address comparison circuit (148) through the stall output (150), the load memory instruction execution unit to stall the corresponding load microinstruction.
Stalling the corresponding load microinstruction typically delays execution of the corresponding load microinstruction (as well as all microinstructions pipelined behind the corresponding load microinstruction) for one processor clock cycle. So stalling the corresponding load microinstruction allows the execution engine to execute the second store microinstruction (116) after executing the first store microinstruction (112) while stalling the corresponding load microinstruction (114) without stalling the second store microinstruction (116). That is, although the corresponding load microinstruction suffers a stall, neither the first store microinstruction nor the second store microinstruction suffers a stall. The store microinstructions execute on immediately consecutive clock cycles, just as they would have done if the corresponding load microinstruction had not stalled.
For further explanation,
In the example of
In the example of
For further explanation,
In the method of
The method of
The method of
The method of
The method of
The method of
In the method of
For further explanation,
Although processor design does not necessarily require that each pipeline stage be executed in one processor clock cycle, it is assumed here for ease of explanation, that each of the pipeline stages in the example of
In this example, therefore, the cache controller stalls the corresponding load microinstruction (420, 411) at time t1. Stalling the corresponding load microinstruction delays execution of the corresponding load microinstruction (410) for one processor clock cycle. The corresponding load microinstruction (410) now executes (422) at time t2. Stalling the corresponding load microinstruction allows the execution engine to execute (418) the second store microinstruction (412) immediately after executing the first store microinstruction (408) while stalling the corresponding load microinstruction (410) without stalling the second store microinstruction (412). That is, although the corresponding load microinstruction (410) suffers a stall, neither the first store microinstruction (408) nor the second store microinstruction (412) suffers a stall. The store microinstructions (408, 412) were dispatched for execution on the immediately consecutive clock cycles, t0 and t2, and the store microinstructions execute on the immediately consecutive clock cycles, t0 and t2, just as they would have done if the corresponding load microinstruction (410) had not stalled.
Design process (610) may include using a variety of inputs; for example, inputs from library elements (630) which may house a set of commonly used elements, circuits, and devices, including models, layouts, and symbolic representations, for a given manufacturing technology (e.g., different technology nodes, 32 nm, 45 nm, 90 nm, etc.), design specifications (640), characterization data (650), verification data (660), design rules (670), and test data files (685) (which may include test patterns and other testing information). Design process (610) may further include, for example, standard circuit design processes such as timing analysis, verification, design rule checking, place and route operations, etc. One of ordinary skill in the art of integrated circuit design can appreciate the extent of possible electronic design automation tools and applications used in design process (610) without deviating from the scope and spirit of the invention. The design structure of the invention is not limited to any specific design flow.
Design process (610) preferably translates a circuit as described above and shown in
Exemplary embodiments of the present invention are described largely in the context of a fully functional computer system for administering an access conflict in a computer memory cache. Readers of skill in the art will recognize, however, that the present invention also may be embodied in a computer program product disposed on signal bearing media for use with any suitable data processing system. Such signal bearing media may be transmission media or recordable media for machine-readable information, including magnetic media, optical media, or other suitable media. Examples of recordable media include magnetic disks in hard drives or diskettes, compact disks for optical drives, magnetic tape, and others as will occur to those of skill in the art. Examples of transmission media include telephone networks for voice communications and digital data communications networks such as, for example, Ethernets™ and networks that communicate with the Internet Protocol and the World Wide Web. Persons skilled in the art will immediately recognize that any computer system having suitable programming means will be capable of executing the steps of the method of the invention as embodied in a program product. Persons skilled in the art will recognize immediately that, although some of the exemplary embodiments described in this specification are oriented to software installed and executing on computer hardware, nevertheless, alternative embodiments implemented as firmware or as hardware are well within the scope of the present invention.
It will be understood from the foregoing description that modifications and changes may be made in various embodiments of the present invention without departing from its true spirit. The descriptions in this specification are for purposes of illustration only and are not to be construed in a limiting sense. The scope of the present invention is limited only by the language of the following claims.
This application is a continuation-in-part of co-pending U.S. patent application Ser. No. 11/536,798, filed Sep. 29, 2006, which is herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
Parent | 11536798 | Sep 2006 | US |
Child | 12105806 | US |