1. Field
This disclosure relates generally to data processing systems, and more specifically, to processor implementations and instruction techniques for representing a loop construct in machine code and executing same in a data processing system.
2. Related Art
Processor designs routinely provide instruction-level mechanisms that may be used to encode loop constructs. For example, many generations of processors have supported loop constructs in the conventional way, i.e., using a control transfer instruction at loop end (often as a conditional or otherwise predicated branch) to branch backward to a first instruction of the loop body code. While programming languages typically present syntactic features that a programmer (and a compiler) can use to identify both the beginning and the end of a loop in source code, there is typically no explicit coding in the stream of machine instructions actually fetched and executed by a processor for both loop beginning and loop end. Unlike the source forms familiar to programmers, machine or assembly language forms of loop code generated by compilers typically include machine instructions for any necessary pre-loop initialization and then drop directly into the sequence of machine instructions that constitute the loop body. As a result, entry into the loop is often unknown (and indeed unrecognizable) to instruction processing and fetch logic at least until the loop is closed by execution of the branch backward to the first instruction of loop body code.
Accordingly, for some computations and in some processor implementations, successive iterations through the loop may incur non-sequential instruction fetch overhead. Although non-deterministic mechanisms such as branch prediction may be suitable in some processor implementations, those mechanisms may not always be attractive, particularly when embedded, real-time applications are involved. In addition, for some computations and in some processor implementations, squandering coding space within an iteratively executed loop body to specify a backward branch target (e.g., using a full instruction pointer-width branch target) may exacerbate problems and even preclude use of otherwise attractive low overhead looping constructs.
The present invention may be better understood, and its numerous objects, features, and advantages made apparent to those skilled in the art by referencing the accompanying drawings.
Instruction set techniques have been developed to explicitly identify the beginning of a loop body and to code a conditional loop-end in ways that allow a processor implementation to manage efficiently an instruction fetch buffer and/or entries in an instruction cache. In particular, for some computations and processor implementations, a machine instruction is defined that identifies a loop start, stores a corresponding loop start address on a return stack (or in other suitable storage) and directs fetch logic to take advantage of the identification by retaining in a fetch buffer or instruction cache the instruction(s) beginning at the loop start address, thereby avoiding usual branch delays on subsequent iterations of the loop. A conditional loop-end instruction can be used in conjunction with the loop start instruction to discard (or simply mark as no longer needed) the loop start address and the loop body instructions retained in the fetch buffer or instruction cache.
For some computations and/or processor implementations, such techniques can be employed to reduce non-sequential instruction fetch overhead in loops. In some embodiments and for certain classes of computations, the ability to deterministically achieve such reductions using techniques described herein may be attractive. In some embodiments, instruction techniques that separate specification of branch target address from invocation of a loop closing branch may also improve code density and can allow some computations to exploit better fixed- or limited-size architectural constructs such as fetch buffers.
For concreteness of description, we focus on certain illustrative loop-delimiting machine instructions that interact with a return address stack and fetch logic of an embedded-type processor that implements a switch on event multithreading (SOEMT) execution model. Of course, embodiments of the present invention are not limited to the illustrated machine instructions, to embedded-type processors or to any particular execution model, multithreaded or otherwise. For generality, the illustrated machine code implements an ordinary, unbounded loop that may nest (or be nested within) other loops or control constructs. However, based on the description herein persons of ordinary skill in the art will appreciate applications of the invented techniques to other loops or control constructs.
In some embodiments, a BEGIN instruction causes the next instruction address (i.e., the address of the first instruction of a loop body) to be pushed onto a return address stack maintained for a particular thread. A subsequently encountered LOOP instruction tests a condition or predicate, whereupon for some results, the LOOP instruction copies the value at top of the return address stack to the program counter (thereby iterating), and for a loop exit case, pops/discards the top value from the return address stack. The BEGIN instruction also directs fetch logic to retain at least the first instruction of a loop body (e.g., in a fetch buffer that, in some embodiments, may be thread- or context-specific), while the LOOP instruction, in the loop exit case, releases the instruction(s) previously retained by the fetch logic. In some embodiments, further loop body instructions (possibly an entire loop body) may be retained and released based on execution of BEGIN and LOOP instructions.
Techniques described herein have broad applicability to other loop constructs and to other processor designs, but will be understood and appreciated by persons of ordinary skill in the art in the illustrated context of BEGIN and LOOP instructions and the utility of such instructions for generally deterministic loop performance on an illustrative embedded-type SOEMT processor. Accordingly, in view of the foregoing and without limitation on the range of instruction set designs; loop constructs; fetch buffer or instruction cache configurations; or underlying processor or system architectures that may be employed in embodiments of the present invention, we describe certain illustrative embodiments.
Processor(s) 12 are of any type in which looping and instruction fetch behaviors are supported based on execution of code that includes loop-delimiting instructions. Typically, implementations of processor(s) 12 include a fetch buffer or other facility (such as an instruction cache) for storing instructions to be executed by the processor(s), decoder and sequencing logic, one or more execution units, and register storage, together with suitable data, instruction and control paths. At any given time, consistent with a computation performed by processor(s) 12, units of program code (e.g., instructions) and data reside in memory(s) 16, cache(s) 14 and/or processor stores (such as the fetch buffer, registers, etc.) In general, any of a variety of hierarchies may be employed, including designs that separate or commingle instructions and data in memory or cache. In addition, although
Internal components of illustrated integrated circuit 20 are interconnected and interoperate using any suitable techniques. For simplicity, we illustrate interconnection amongst major functional blocks via a bus DBUS and separate dedicated pathways (e.g., busses) for transfer of data to/from a local data store 23 and for fetching instructions from a local control store 22. That said, persons of ordinary skill in the art will recognize that any of a variety of interconnection techniques and topologies may be employed. In general, integrated circuit 20 may interface with external components (e.g., a host processor or system), transmit/receive circuits, event sources, input output devices, etc., via external buses or using other suitable interfaces.
In the illustration of
At any given time, consistent with a computation performed, units of program code (e.g., instructions) reside in control store 22 and units of data reside in data store 23 and/or in stores provided within processor core 21 (such as context-specific fetch buffers, registers, etc.) In general, configuration of
Consistent with a wireless MAC protocol controller application, the embedded-type data processing system illustrated in
Design choices made in at least some processor and integrated circuit implementations may deemphasize or eliminate the use of priority interrupts more commonly employed in conventional general purpose processor designs and instead, treat real-time (exogenous and endogenous) conditions as events. For example, in some implementations, assertion of an (enabled) event activates a corresponding one of multiple execution contexts, where each such context has (or can be viewed as having) its own program counter, fetch buffer and a set of programmer-visible registers. Contexts then compete for execution cycles using prioritized, preemptive multithreading, sometimes called “Switch-On-Event MultiThreading” (SOEMT). In some implementations, context switching occurs under hardware control with zero overhead cycles.
Generally, an instruction that has been issued will complete its execution, even if a context switch occurs while that instruction is still in the execution pipeline. In an illustrative SOEMT processor implementation, once a context is activated, the activated code runs to completion (subject to delays due to preemption by higher-priority contexts). If another of the context's events is asserted while the context is active to handle a previous event, handling of the second event occurs immediately after the running event handler terminates.
Note that return addresses and loop start addresses need not be commingled in a single stack-type structure. Indeed, in some embodiments, storage for the loop start address may be provided by a separate register or (to support nesting) by a group of registers. In general, storage (whether organized as a stack or in accord with some other data organization) may be implemented in dedicated, shared, allocable and/or context- or thread-specific hardware or as a similar, in-memory structure. In some embodiments, a logical stack may be represented partly in register storage and partly in memory. For example, a top element of the stack may be maintained in a hardware register, while software manages storage by (i) saving register contents into an appropriate location in memory when another return/loop address is to be pushed and (ii) restoring to the register from the memory after consumption of an address from top of stack. In some embodiments, a design that employs multiple context-specific instances of a hardware register stack in which return addresses and loop start addresses are commingled may be desirable as processor logic and data paths that exist to push a next instruction address onto a stack (in connection with a call instruction) may be used to implement the BEGIN instruction. Similarly, to support the LOOP instruction and iterative behaviors described herein, extensions to return logic can selectively allow an address to be copied (rather than popped) from top of stack via a data path to the PC store that may already exist to support call return-related control transfers. Likewise, a pop and discard option can allow the return logic to handle the loop exit case of the LOOP instruction as described herein. Based on the description and claims herein, persons of ordinary skill in the art will appreciate a wide variety of suitable implementations for loop start address storage for BEGIN and LOOP instruction operations.
Note that, while the illustrated SOEMT-type processor core implementation provides hardware support for multiple active contexts including context-specific portions of fetch buffers 360 and registers 315, other embodiments need not implement the sample multithreaded execution model and/or may support differing allocations of fetch buffer and register resources. Nonetheless, in the interest of concreteness, we illustrate certain embodiments in view of structures and terminology appropriate to the illustrated SOEMT-type processor core. Based on the description herein, persons of ordinary skill in the art will appreciate variations and/or simplifications for other embodiments.
Referring to
Note that, while some instruction set codings explained herein with respect to certain illustrative embodiments tend to assume a single instruction position displacement between a BEGIN instruction and the loop start address, persons of ordinary skill in the art will appreciate that other displacements may be appropriate in other embodiments. For example, other displacements may be desirable or acceptable in some embodiments. In general, loop body coding density may be improved if the BEGIN instruction (or its analog) appears outside the loop body; nonetheless, in some embodiments it may be acceptable to code a loop delimiting instruction together with (or just following) an instruction or instruction grouping that begins the loop body.
Configurations and interconnection of memory controller 350, memory 357, host interface 340 and PHY 321 with SOEMT core 310 via the illustrated bus DBUS are purely illustrative. Indeed, based on the description herein, many variations will be appreciated by persons of ordinary skill in the art.
Turning now to an illustrative instruction set, techniques have been developed to identify explicitly the beginning of a loop body and to code a conditional loop closing branch in ways that allow a processor implementation to manage efficiently an instruction fetch buffer and/or entries in an instruction cache. In general, useful exploitations of these techniques can be embodied in an instruction set architecture and in concrete implementations thereof (e.g., as microprocessor integrated circuit implementations of a computation machine), as well as in computer readable encodings of program code that employ execution sequences of machine instructions that include loop-delimiting instructions of the type(s) described herein. In general, such program code may be prepared by machine instruction level programmers or generated by a compiler or other transformative method from iteration constructs that appear in a source level language or other precursor form.
By way of example, a first loop delimiting instruction (canonically a BEGIN instruction) is executable on a computational machine to identify a “loop start address” in program code, to store that loop start address on a return stack (or in other suitable storage of the computational machine) and to direct fetch logic to take advantage of the identification by retaining in a fetch buffer or instruction cache the instruction(s) beginning at the loop start address. Typically, the loop start address is the address of an instruction that immediately follows, or is located at a predetermined offset from, the BEGIN instruction. In this way, an execution sequence of machine instructions that includes a BEGIN instruction and an associated loop body can avoid instruction fetch delays that might otherwise be incurred if, upon iteration (and backward branch to the loop start address), the instruction at the loop start address has been displaced from the fetch buffer or instruction cache. For some loops, e.g., inner loop bodies consisting of short linear sequences (without branches) well within the capacity of a fetch buffer or instruction cache, non-sequential fetch overheads may not be a significant problem. However, more generally, for loops of larger size or for execution trajectories that for some other reason (such as loops or call/return sequences nested within a loop body, embedded multi-way branches and/or switch/case statements) may tend to displace the instruction appearing at the loop start address from a fetch buffer or instruction cache, the techniques described herein provide a useful technique for deterministically avoiding non-sequential fetch overheads that may be incurred on iteration. Note that large loop bodies may be coded by a programmer or may result from in-lining of code, e.g., by an optimizing compiler, for one or more called functions.
A second loop delimiting instruction (canonically a LOOP instruction) is executable on the computational machine to perform a specified condition test (or to test an appropriate condition code or predicate) and based on some results thereof (e.g., a true value) to iterate, while based on other results thereof (e.g., a false value) to exit. Of course, other codings and senses of loop continuation and loop exit conditions may be employed. More specifically, in the iterating case, the LOOP instruction initiates another pass through loop body code by copying to the program counter the loop start address which was previously stored on return stack (or in some other suitable storage) by a corresponding execution of the BEGIN instruction. In this way, the LOOP instruction need not expend coding space in the loop body to encode a branch target address or offset field. Furthermore, since the loop start address is not coded in the machine code itself, even a full instruction pointer width address can be used (without adversely affecting code density), and the extent of loop body code need not be limited, except by the address space of the computational machine. Such flexibility is in contrast with constraints typical of conventional instruction set approaches that seek to encode an address target in a small (e.g., 8-bit) offset field of a conditional branch instruction coding. In the exiting case, the LOOP instruction discards the previously stored loop start address and indicates to fetch logic that the instruction(s) previously retained beginning at the loop start address need not be retained in the fetch buffer or instruction cache for near-term re-execution.
In general, test and branch logic consistent with the above-described operation the LOOP instruction may be implemented in any appropriate place (including e.g., in a branch unit of ALU(s) 316, see
The illustration of
In a conventional RISC instruction set implementation, the loop closing branch requires 32-bits. Even in a high-code density RISC implementation such as THUMB, MIPS-16, or Tensilica, the loop closing branch can require 16 or 24 bits (with 8-16 bits allocated to specification of the branch target). In addition, because backward branch 614 conditionally breaks an address-sequential program sequence, in those cases where the extent of loop body 612 exceeds capacity of a fetch buffer or execution of the loop body overwrites contents of relevant lines of the processor's instruction cache, a processor executing the code of
In contrast,
After machine instructions that implement the source-level loop body described with reference to
Although the code snippet of
Note that, in contrast with the loop closing branch and in-loop-body coding of branch target illustrated in
For an SOEMT processor implementation that employs the techniques described herein, advantages can be significant. For example, in a network or communications controller implementation, tighter loops and reduced fetch latencies can allow a higher symbol rate to operating frequency ratio. Accordingly, in some designs, it is possible to achieve a target symbol rate at lower operating frequency and with lower power consumption. Conversely, in some designs, it can be possible to achieve higher symbol rates at a given operating frequency and/or power budget.
In the illustrated instruction sequence, execution (781) of a first loop delimiting instruction (BEGIN instruction 623) pushes (773) an instruction pointer onto a return stack R which is represented (at least partially) in storage provided by context registers 715. Note that, in general, return stack R may be implemented in hardware, e.g., as a physical register or memory stack, or may be implemented in software with a top-of-stack register maintained in hardware and push/pop activity performed by software. Nonetheless, for simplicity and clarity of illustration, physical register storage is presumed. The instruction pointer identifies the first instruction of loop body 624 (here, LD instruction 626) and serves as the loop closing branch target for a subsequently executed instance of the LOOP instruction. Execution (781) of BEGIN instruction 623 also directs (772) fetch logic 311 to maintain the identified first instruction of loop body 624 in a fetch buffer 771. In the illustrated embodiment, fetch buffer 771 is associated with currently executing context 701, although other multi-threaded processor embodiments may share a fetch buffer or instruction cache amongst executing contexts, in which case, directive (772) would apply to the shared fetch buffer or instruction cache. Note that, depending on the implementation, such a directive may cover (774) LD instruction 626 itself or a fetch group of instructions that includes LD instruction 626 or may extend to a larger set of instructions (or fetch groups) that may (in some cases) cover the entirety of loop body 624.
Execution (782) of a corresponding instance of the second loop delimiting instruction (LOOP instruction 625) determines, based on a condition code established by the compare instruction (CMPI 10) that precedes it, whether the execution sequence branches backward to the first instruction of loop body 624 or falls through to the instruction that follows. Note that, more generally, any of a variety of predicates, values and/or condition codes may be evaluated in the course of executing a LOOP-type instruction and semantics of any LOOP instruction (or variant thereof) are implementation dependent.
In the first case (iteration/backward branch), LOOP instruction 625 copies (775) to program counter 716 the instruction pointer corresponding to the first instruction of loop body 624 (here, LD instruction 626), which was previously pushed (773) onto a return stack R (in storage provided by context registers 715) by execution (781) of BEGIN instruction 623. Note that, in some embodiments, return address pointers are also pushed onto return stack R in connection with CALL-type instructions and then popped and used to update program counter 716 in accordance with a RETURN-type instruction. Indeed, in some embodiments, loop start addresses and return addresses are commingled in return stack R and some shared resources are used to support execution of the BEGIN and CALL instructions (on the one hand) and LOOP and RETURN instructions on the other hand. However, unlike a RETURN instruction, iterating executions of LOOP instruction 625 leave the previously pushed (773) loop start address at the top of return stack R for potential reuse in a subsequent iteration.
In the second case (loop exit/fall through), execution (782) of LOOP instruction 625 pops (776) the previously pushed (773) loop start address from the top of return stack R and the value in program counter 716 increments normally, allowing the execution sequence to exit loop body 624. Finally, in the second case (loop exit/fall through), execution (782) of LOOP instruction 625 rescinds (777) the prior directive (772) that fetch logic 311 maintain the identified first instruction of loop body 624 in a fetch buffer 771.
Although the illustration of
While the illustration of
In general, techniques in accordance with the present invention can allow arbitrary levels of nesting of return addresses. However, in actual practice, there may not be a corresponding ability to retain an arbitrary number of loop start instructions in a fetch buffer. However, even given such practical constraints, one suitable strategy for nested BEGIN/LOOP constructs is to retain only the N (where N>=1) most recent loop start instructions, based on the fact that these are the innermost loops, which (necessarily) need their loop start instruction addresses more frequently than the enclosing loops. Although the consequence of discarding a retained loop start instruction address for an outer loop is the extra time to perform the non-sequential instruction fetch, the loop executes properly whether or not the initial instruction is retained.
Although the invention is described herein with reference to specific embodiments, various modifications and changes can be made without departing from the scope of the present invention as set forth in the claims below. For example, while techniques have been described that allow certain embedded-type processor implementations to limit non-deterministic fetch overheads that may otherwise be incurred in some iterations of a loop construct, the described techniques have broad applicability to a variety of processor types and implementations. Similarly, although the described techniques may be employed to facilitate high density codings of machine code and thereby support high symbol rate to operating frequency ratios desirable for communications processors, the techniques are not limited thereto.
Embodiments of the present invention may be implemented using any of a variety of different information processing systems. Accordingly, while
Articles, system and apparati that implement the present invention are, for the most part, composed of electronic components, circuits and/or code (e.g., software, firmware and/or microcode) known to those skilled in the art and functionally described herein. Accordingly, component, circuit and code details are explained at a level of detail necessary for clarity, for concreteness and to facilitate an understanding and appreciation of the underlying concepts of the present invention. In some cases, a generalized description of features, structures, components or implementation techniques know in the art is used so as avoid obfuscation or distraction from the teachings of the present invention.
In general, the terms “program” and/or “program code” are used herein to describe a sequence or set of instructions designed for execution on a computer system. As such, such terms may include or encompass subroutines, functions, procedures, object methods, implementations of software methods, interfaces or objects, executable applications, applets, servlets, source, object or intermediate code, shared and/or dynamically loaded/linked libraries and/or other sequences or groups of instructions designed for execution on a computer system.
In some embodiments of the present invention, a computer program product is embodied in at least one computer readable medium and includes program code executable on a processor, wherein the program code includes a loop construct encoded using delimiting BEGIN- and LOOP-type instructions. All or some of the program code described herein, as well as any software implemented functionality of information processing systems described herein, may be accessed or received by elements of a information processing system, for example, from computer readable media or via other systems. In general, computer readable media may be permanently, removably or remotely coupled to an information processing system. Computer readable media may include, for example and without limitation, any number of the following: magnetic storage media including disk and tape storage media; optical storage media such as compact disk media (e.g., CD-ROM, CD-R, etc.) and digital video disk storage media; nonvolatile memory storage media including semiconductor-based memory units such as FLASH memory, EEPROM, EPROM, ROM; ferromagnetic digital memories; MRAM; volatile storage media including registers, buffers or caches, main memory, RAM, etc.; and media incident to data transmission including transmissions via computer networks, point-to-point telecommunication equipment, and carrier waves or signals, just to name a few.
Finally, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and consistent with the description herein, a broad range of variations, modifications and extensions are envisioned. Any benefits, advantages, or solutions to problems that are described herein with regard to specific embodiments are not intended to be construed as a critical, required, or essential feature or element of any or all the claims.