1. Field of the Invention
The invention is generally related to systems and methods for performing one or more operations on one or more elements using a multiple data processing element processor.
2. Related Art
Multiple data processing element processors, e.g., a single instruction multiple data (SIMD) or multiple instruction multiple data (MIMD), receive multiple data inputs, operate on the inputs, and output the results of the operation to, for instance, an output register. As an example, such a processor might receive inputs a, b, c, and d and add them together to produce the results a+b and c+d. Occasionally, performing the prescribed operation on one or more of the data inputs is problematic for the processor and it generates an exception. This happens, for instance, when the prescribed operation is not implemented for the processor for the inputs provided. In such a scenario, the processor would be unable to perform this operation and would generate an exception.
When an exception occurs, typically no results are written to the output register and the exception is handled by an exception handler using software emulation, for instance, to perform the operation on the data inputs or to deal with the exception in some other way. The problem with this method is that it can be slow and resource intensive. Furthermore, in many instances only a few of the multiple data inputs cause an exception when the operation is performed; the majority of the data inputs do not cause an exception when the operation is performed. However, the processing of an exception typically also delays the processing of data that is not associated with the exception as the exception handler cannot discern which data inputs are the cause of the exception.
What is needed, therefore, are systems and methods that allow more precise exception signaling so that an exception handler need only handle the data associated with a valid exception while allowing the data inputs that are not the cause of an exception to be timely processed by one or more processing elements. According to embodiments of the invention, a method of performing one or more operations on a plurality of elements using a multiple data processing element processor is provided. An input vector comprising a plurality of elements is received by a processor. The processor determines if performing a first operation on a first element will cause an exception and if so, writes an indication of the exception caused by the first operation to a first portion of an output vector stored in an output register. A second operation can be performed on a second element with the result of that second operation being written to a second portion of the output vector stored in the output register.
Embodiments of the invention include a multiple data processing element processor. The system includes an input register, an output register, and a multiple data processing element processor. The input register can be configured to store an input vector comprising a plurality of elements. The output register can be configured to store the results of a plurality of operations. The processor is configured to receive the input vector from the input register, and determine that performing a first operation on a first element will cause an exception and output an indication of the exception caused by the first operation to a first portion of an output vector stored in the output register. Additionally, the processor can be configured to perform a second operation on a second element and output the result of the second operation to a second portion of the output vector stored in the output register.
Some embodiments of the invention include a method of performing an operation on a plurality of elements using a multiple data processing element processor. The method includes receiving an input vector that includes a first and a second element and determining that the performing of a first operation on a first element will cause an exception. In this case the method continues by writing an indication of the exception cause by the first operation to a first portion of an output vector stored in an output register. Further, the method includes performing a second operation on the second element and writing a result of the second operation to a second portion of the output vector stored in the output register.
The accompanying drawings, which are incorporated herein and form part of the specification, illustrate the present invention and, together with the description, further serve to explain the principles of the invention and to enable a person skilled in the pertinent art to make and use the invention.
a and 2b depict multiple data operations according to various embodiments of the invention.
Features and advantages of the invention will become more apparent from the detailed description of embodiments of the invention set forth below when taken in conjunction with the drawings in which like reference characters identify corresponding elements throughout. In the drawings, like reference numbers generally indicate identical, functionally similar, and/or structurally similar elements. The drawings in which an element first appears is indicated by the leftmost digit(s) in the corresponding reference number.
The following detailed description of embodiments of the invention refers to the accompanying drawings that illustrate exemplary embodiments. Embodiments described herein relate to a low power multiprocessor. Other embodiments are possible, and modifications can be made to the embodiments within the spirit and scope of this description. Therefore, the detailed description is not meant to limit the embodiments described below.
It should be apparent to one of skill in the relevant art that the embodiments described below can be implemented in many different embodiments of software, hardware, firmware, and/or the entities illustrated in the figures. Any actual software code with the specialized control of hardware to implement embodiments is not limiting of this description. Thus, the operational behavior of embodiments will be described with the understanding that modifications and variations of the embodiments are possible, given the level of detail presented herein.
Inputs 102a and 102b may each comprise one or more registers capable of storing one or more input vectors. Additionally, according to some embodiments, the processor can be provided with a single input vector 102 stored on a single register. The input vectors can each include a number of data elements for processing by the processor. For instance, the processor 104 may perform an operation on a set of one or more elements to produce a result. As an example, assume input 102 contains elements x and y. Processor 104 may be configured to perform operation f on elements x and y and produce a result z such that z=f(x,y). Processor 104, however, can be configured to perform an operation on any number of elements from input 102.
Processor 104 may comprise a multiple data processing element processor such as a single instruction multiple data (SIMD) processor according to some embodiments. Additionally, the processor 104 may comprise a multiple instruction multiple data (MIMD) processor. The processor can be configured to perform a number of different operations (e.g., add, subtract, divide, multiply, shift, etc.) based on the instruction input 108. The processor can also be configured to output the result of the operation to the output register 106.
Processor 104 may be configured to receive a control signal 110 that controls whether the processor operates in a non-signaling exception mode according to various embodiments. When the processor is not operating in a non-signaling exception mode, processor 104 can be thought of operating in a “normal” mode. That is, when an exception is generated by operation on any of the elements, the processor signals the exception and an exception handler handles the operation for all the elements. However, when processor 104 is operating in non-signaling exception mode, the processor does not signal that an exception has occurred and, instead, indicates an exception in the output register only for the specific operations that caused the exception while allowing operation on the other elements to proceed and the result to be written to the output register.
a illustrates an operation performed by processor 104. For instance, as depicted, processor 104 receives a first input vector 202 comprising elements A0, A1, A2, and A3. The vector may be of any length and may be stored in a register. As an example, if first input vector 202 is stored in a 64 bit register, then each of elements A0, A1, A2, and A3 may comprise 16 bits. Similarly to first input vector 202, second input vector 206 may also comprise a number of elements B0, B1, B2, and B3. Additionally, the second input vector 206 may be stored in a register of any length and need not be the same length as the register that stores first input vector 202.
According to embodiments of the invention, processor 104 can be configured to perform operations 204 on the elements in input vectors 202 and 206. Operations 204 can be defined by input instruction 108. In some embodiments (e.g., in embodiments where processor 104 is a SIMD processor), there will be only one instruction and the same operation will be performed on each of the input element pairs. This situation is depicted in
As with input vectors 202 and 206, result vector 208 may be stored in a register such as output register 106. While the output register may be of any size, it is preferably large enough to prevent overflow under any or most circumstances. For instance, output register may be larger than either of input vectors 202 and 206 according to aspects of the invention.
b illustrates a situation similar to that depicted by
At step 304, the processor determines that performing an operation on a first element or first set of elements will cause an exception. An indication that performing the operation on the first element or set of elements will cause an exception is output to a corresponding position in an output register at step 306. The operation on the second element can be performed at step 308 and the result of the operation on the second element stored in a corresponding location of an output register at step 310. According to some embodiments, steps 304 and 306 may be performed in parallel with steps 308 and 310.
At step 404, the processor determines whether a non-signaling exception mode has been enabled or not. The mode can be enabled or disabled by setting or unsetting a control bit in the processor according to various embodiments. If the mode is disabled, then the processor performs the operation or operations on the elements according to a normal exception signaling method at step 418. That is, when an exception occurs, the processor signals an exception and allows an exception handler to perform the operation or operations on all of the input elements regardless of which element or set of elements caused the exception.
If it is determined that the non-signaling mode is enabled at step 404, then the processor determines whether an element or set of elements will generate an exception at step 406. If the element or set of elements will generate an exception, then the processor generates an indication of the exception at step 408 and outputs the indication to an output register at step 410. According to embodiments, the indication can identify the elements and the operation that caused the exception. If it is determined that the element or set of elements will not cause an exception, then the operation is performed at step 412 and the result of the operation on the element or elements is output to the output register at step 414. At step 416, the method loops back to step 406 if there are more elements to consider, otherwise it ends at 420. While
It will be appreciated that various embodiments may be implemented or facilitated by or in cooperation with hardware components enabling the functionality of the various software routines, modules, elements, or instructions. Example hardware components are described further with respect to
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example, and not limitation. It will be apparent to persons skilled in the relevant computer arts that various changes in form and detail can be made therein without departing from the spirit and scope of the invention. Furthermore, it should be appreciated that the detailed description of the present invention provided herein, and not the summary and abstract sections, is intended to be used to interpret the claims. The summary and abstract sections may set forth one or more but not all exemplary embodiments of the present invention as contemplated by the inventors.
For example, in addition to implementations using hardware (e.g., within or coupled to a Central Processing Unit (“CPU”), microprocessor, microcontroller, digital signal processor, processor core, System on Chip (“SOC”), or any other programmable or electronic device), implementations may also be embodied in software (e.g., computer readable code, program code, instructions and/or data disposed in any form, such as source, object or machine language) disposed, for example, in a computer usable (e.g., readable) medium configured to store the software. Such software can enable, for example, the function, fabrication, modeling, simulation, description, and/or testing of the apparatus and methods described herein. For example, this can be accomplished through the use of general programming languages (e.g., C, C++), GDSII databases, hardware description languages (HDL) including Verilog HDL, VHDL, SystemC Register Transfer Level (RTL) and so on, or other available programs, databases, and/or circuit (i.e., schematic) capture tools. Embodiments can be disposed in any known non-transitory computer usable medium including semiconductor, magnetic disk, optical disk (e.g., CD-ROM, DVD-ROM, etc.).
It is understood that the apparatus and method embodiments described herein may be included in a semiconductor intellectual property core, such as a microprocessor core (e.g., embodied in HDL) and transformed to hardware in the production of integrated circuits. Additionally, the apparatus and methods described herein may be embodied as a combination of hardware and software. Thus, the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalence. It will be appreciated that embodiments using a combination of hardware and software may be implemented or facilitated by or in cooperation with hardware components enabling the functionality of the various software routines, modules, elements, or instructions, e.g., the components noted above with respect to
As shown in
Execution unit 602 preferably implements a load-store (RISC) architecture with single-cycle arithmetic logic unit operations (e.g., logical, shift, add, subtract, etc.). Execution unit 602 interfaces with fetch unit 604, floating point unit 606, load/store unit 608, multiple-divide unit 620, co-processor 622, general purpose registers 624, and core extend unit 634.
Fetch unit 604 is responsible for providing instructions to execution unit 602. In one embodiment, fetch unit 604 includes control logic for instruction cache 612, a recoder for recoding compressed format instructions, dynamic branch prediction and an instruction buffer to decouple operation of fetch unit 604 from execution unit 602. Fetch unit 604 interfaces with execution unit 602, memory management unit 610, instruction cache 612, and bus interface unit 616.
Floating point unit 606 interfaces with execution unit 602 and operates on non-integer data. Floating point unit 606 includes floating point registers 618. In one embodiment, floating point registers 618 may be external to floating point unit 606. Floating point registers 618 may be 32-bit or 64-bit registers used for floating point operations performed by floating point unit 606. Typical floating point operations are arithmetic, such as addition and multiplication, and may also include exponential or trigonometric calculations.
Load/store unit 608 is responsible for data loads and stores, and includes data cache control logic. Load/store unit 608 interfaces with data cache 614 and scratch pad 630 and/or a fill buffer (not shown). Load/store unit 608 also interfaces with memory management unit 610 and bus interface unit 616.
Memory management unit 610 translates virtual addresses to physical addresses for memory access. In one embodiment, memory management unit 610 includes a translation lookaside buffer (TLB) and may include a separate instruction TLB and a separate data TLB. Memory management unit 610 interfaces with fetch unit 604 and load/store unit 608.
Instruction cache 612 is an on-chip memory array organized as a multi-way set associative or direct associative cache such as, for example, a 2-way set associative cache, a 4-way set associative cache, an 8-way set associative cache, et cetera. Instruction cache 612 is preferably virtually indexed and physically tagged, thereby allowing virtual-to-physical address translations to occur in parallel with cache accesses. In one embodiment, the tags include a valid bit and optional parity bits in addition to physical address bits. Instruction cache 612 interfaces with fetch unit 604.
Data cache 614 is also an on-chip memory array. Data cache 614 is preferably virtually indexed and physically tagged. In one embodiment, the tags include a valid bit and optional parity bits in addition to physical address bits. Data cache 614 interfaces with load/store unit 608.
Bus interface unit 616 controls external interface signals for processor core 600. In an embodiment, bus interface unit 616 includes a collapsing write buffer used to merge write-through transactions and gather writes from uncached stores.
Multiply/divide unit 620 performs multiply and divide operations for processor core 600. In one embodiment, multiply/divide unit 620 preferably includes a pipelined multiplier, accumulation registers (accumulators) 626, and multiply and divide state machines, as well as all the control logic required to perform, for example, multiply, multiply-add, and divide functions. As shown in
Co-processor 622 performs various overhead functions for processor core 600. In one embodiment, co-processor 622 is responsible for virtual-to-physical address translations, implementing cache protocols, exception handling, operating mode selection, and enabling/disabling interrupt functions. Co-processor 622 interfaces with execution unit 602. Co-processor 622 includes state registers 628 and general memory 638. State registers 628 are generally used to hold variables used by co-processor 622. State registers 628 may also include registers for holding state information generally for processor core 600. For example, state registers 628 may include a status register. General memory 638 may be used to hold temporary values such as coefficients generated during computations. In one embodiment, general memory 638 is in the form of a register file.
General purpose registers 624 are typically 32-bit or 64-bit registers used for scalar integer operations and address calculations. In one embodiment, general purpose registers 624 are a part of execution unit 602. Optionally, one or more additional register file sets, such as shadow register file sets, can be included to minimize content switching overhead, for example, during interrupt and/or exception processing.
Scratch pad 630 is a memory that stores or supplies data to load/store unit 608. The one or more specific address regions of a scratch pad may be pre-configured or configured programmatically while processor core 600 is running. An address region is a continuous range of addresses that may be specified, for example, by a base address and a region size. When base address and region size are used, the base address specifies the start of the address region and the region size, for example, is added to the base address to specify the end of the address region. Typically, once an address region is specified for a scratch pad, all data corresponding to the specified address region are retrieved from the scratch pad.
User Defined Instruction (UDI) unit 634 allows processor core 600 to be tailored for specific applications. UDI 634 allows a user to define and add their own instructions that may operate on data stored, for example, in general purpose registers 624. UDI 634 allows users to add new capabilities while maintaining compatibility with industry standard architectures. UDI 634 includes UDI memory 636 that may be used to store user added instructions and variables generated during computation. In one embodiment, UDI memory 636 is in the form of a register file.
Embodiments described herein relate to a shared register pool. The summary and abstract sections may set forth one or more but not all exemplary embodiments of the present invention as contemplated by the inventors, and thus, are not intended to limit the present invention and the claims in any way.
The embodiments herein have been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries may be defined so long as the specified functions and relationships thereof are appropriately performed.
The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others may, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such specific embodiments, without undue experimentation, without departing from the general concept of the present invention. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.