1. Field of the Invention
This invention relates generally to the field of processors and more particularly, to processors having low power consumption, high performance, low die area, and flexibly and scalably employed in multimedia and communications applications.
2. Description of the Prior Art
With the advent of the popularity of consumer gadgets, such as cell or mobile phones, digital cameras, iPods and personal data assistances (PDAs), many new standards for communication with these gadgets have been adopted by the industry at wide. Some of these standards include H264, MPEG4, UWB, Bluetooth, 2G/2.5G/3G/4G, GPS, MP3 and Security. However, an emerging problem is the use of different standards dictating communications of and between different gadgets requiring tremendous development effort. One of the reasons for the foregoing problem is that no processor or sub-processor, currently available in the marketplace, is easily programmable for use by all digital devices and conforming to the various mandated standards. It is only a matter of time before this problem grows as new trends in consumer electronics warrant even more standards adopted by the industry in the future.
One of the emerging and, if not, current, requirements of processors is low power consumption yet the ability to cause execution of code sufficient to process multiple applications. Current power consumption is on the order of sub-hundreds of milliwatts per application, whereas, the goal is to be under sub-hundreds of milliwatts for executing multiple applications. Another requirement of processors is low cost. Due to the wide utilization of processors in consumer products, the processor must be inexpensive to manufacture, otherwise, its use in most common consumer electronics is not pragmatic.
To provide specific examples for current processor problems, problems associated with RISCs, which are used in some consumer products, microprocessors, which are used in other consumer products, digital signal processors (DSPs), which are used in yet other consumer products and application specific integrated circuits (ASICs), which are used in still other consumer products, and some of the other well-know processors, each exhibiting a unique problem are briefly described below. These problems along with advantages of using each are outlined below in a “Cons” section discussing the disadvantages thereof and a “Pros” section discussing the benefits thereof.
RISC/Super Scalar Processors
RISC and Super Scalar processors have been the most widely accepted architectural solution for all general purpose computing. They are often enhanced with application specific accelerators for solving certain specialized problems within the context of a general solution.
Examples include: ARM series, ARC series, StrongARM series, and MIPS series.
Pros:
Cons:
Very Long Instruction Word (VLIW) and DSPs
VLIW architectures eliminated some of the inefficiencies found in RISC and Super Scalar architectures to create a fairly general solution in the digital signal processing space. Parallelism was significantly increased. The onus of scheduling was transferred from hardware to software to save area.
Examples include: TI 64xx, TI 55xx, StarCore SC140, ADI SHARC series.
Pros:
Cons:
Reconfigurable Computing
Several efforts in industry and academia over the last 10 years were focused towards making a flexible solution with ASIC like price, power and performance characteristics. Many have challenged existing and matured laws and design paradigms with little industry success. Most of the attempts have been in the direction of creating solutions based on coarser grain FPGA like architectures.
Pros:
Cons:
Array of Processors
Some recent approaches are focused on making reconfigurable systems better suited to process heterogeneous applications. Solutions in this direction connect multiple processors optimized for either one or a set of applications to create a processor array fabric.
Pros:
Cons:
In light of the foregoing, there is a need for a low-power, inexpensive,efficient, high-performance, flexibly programmable, heterogenous processor for allowing execution of one or more multimedia applications simultaneously.
Briefly, one embodiment of the present includes of the embodiment thereof includes a heterogenous, high-performance, scalable processor including at least one W-type sub-processor capable of processing W bits, or more, in parallel, W being an integer value, at least one N-type sub-processor capable of processing N bits in parallel, N being an integer value wherein and smaller than W, a shared bus coupling the at least one W-type sub-processor and at least one N-type sub-processor; and at least one Galois Field (GF) MAC coupled to communicate with the W-type sub-processor and the N-type sub-processor, wherein the W-type sub-processor rearranges bytes in transit to or from memory to accommodate execution of applications allowing for fast operations.
Table 1 shows an example of the latency and turnaround associated with various operations of the GF MAC 4039.
Table 2 shows a summary of special ALU arithmetic and logical operations.
Referring now to
Accordingly, the product 12 is a converging product in that it incorporates all of the applications that need to be executed by today's mobile phone device 14, digital camera device 16, digital recording or music device 18 and PDA device 20. The product 12 is capable of executing one or more of the functions of the devices 14-20 simultaneously yet utilizing less power.
The product 12 is typically battery-operated and therefore consumes little power even when executing multiple applications of the applications executed by the devices 14-20. It is also capable of execute code to effectuate operations in conformance with a multitude of applications including but not limited to: H264, MPEG4, UWB, Bluetooth, 2G/2.5G/3G/4G, GPS, MP3 and Security.
The interface circuit 26 shown coupled to the bus 30 and interface circuit 28, shown coupled to the bus 31, include the blocks 40-66, which are generally known to those of ordinary skill in the art and used by current processors.
The processor 22, which is a heterogeneous multi-processor, is shown to include shared data memory 70, shared data memory 72, a CoolW sub-processor (or block) 74, a CoolW sub-processor (or block) 76, a CoolN sub-processor (or block) 78 and a CoolN sub-processor (or block) 80. Each of the blocks 74-80 has associated therewith an instruction memory, for example, the CoolW block 74 has associated therewith an instruction memory 82, the CoolW block 76 has associated therewith an instruction memory 84, CoolN block 78 has associated therewith an instruction memory 86 and the CoolN block 80 has associated therewith an instruction memory 88. Similarly, each of the blocks 74-80 has associated therewith a control block. The block 74 has associated therewith a control block 90, the block 76 has associated therewith a control block 92, the block 78 has associated therewith a control block 94 and the block 80 has associated therewith a control circuit 96. The block 74 and 76 are designed to generally operate efficiently for 16, 24, 32 and 64-bit operations or applications, whereas, the blocks 78 and 80 are designed to generally operate efficiently for 1, 4, or 8-bit operations or applications.
The blocks 74-80 are essentially sub-processors and the CoolW blocks 74 and 76 are wide (or W) type of blocks, whereas, the CoolN blocks 78 and 80 are narrow (or N) type of blocks. Wide and narrow refers to the relative number of parallel bits processed or routed within a sub-processor and that gives the heterogeneous characteristic of the processor 22. Furthermore, the circuit 24 is coupled directly to one of the sub-processors, i.e. one of the blocks 74-80 resulting in the lowest latency path through the sub-processor to which it is coupled. In
It should be noted that while four blocks 74-80 are shown, other number of blocks may be utilized, however, utilizing additional blocks clearly results in additional die space and higher manufacturing costs.
Complicated applications requiring great processing power are not scattered in the circuit 20, rather, they are grouped or confined to a particular sub-processor or block for processing, which substantially improves power consumption by eliminating or at least reducing wire (metal) or routing lengths thereby reducing wire capacitance. Additionally, utilization is increased and activity is reduced contributing to lower power consumption.
The circuit 20 is an example of silicon on chip (or SoC) offering Quasi-Adiabatic Programmable sub-Processors for multimedia and communications applications, two types of sub-processors are included, as previously indicated: W type and N type. W type or Wide type processor is designed for high Power, Price, Performance efficiency in applications requiring 16, 24, 32 and 64-bits of processing. N type or Narrow type processor is designed for high efficiency in applications requiring 8, 4 and 1-bit of processing. While these bit numbers are used in the embodiments of the present invention, by way of figures and description, other number of bits may be readily employed.
Different applications require different performance or processing capabilities and are thus, executed by a different type of block or sub-processor. Take for instance, applications that are typically executed by DSPs, they would be generally be processed by W type sub-processors, such as the blocks 74 or 76 of
Other commonly occurring DSP kernels can be executed by N type sub-processors, such as blocks 78 and 80 and include, but are not limited to, Variable Length Codec, Viterbi Codec, Turbo Codec, Cyclic Redundancy Check, Walsh Code Generator, Interleaver/De-Interleaver, LFSR, Scrambler, De-spreader, Convolution Encoder, Reed-Solomon Codec, Scrambling Code Generator, and Puncturing/De-puncturing.
Both W and N type sub-processors are capable of keeping net activity and the resulting energy per transition low while maintaining high performance with increased utilization in comparison with existing architectural approaches like RISC, Reconfigurable, Superscalar, VLIW and Multi-processor approaches. The sub-processor architecture of the processor 22 reduces die size resulting in an optimal processing solution and includes a novel architecture referred to as “Quasi-Adiabatic” or “COOL” architecture. Programmable processors in accordance therewith are referred to as Quasi-Adiabatic Programmable or COOL Processors.
Quasi-Adiabatic Programmable or COOL Processors optimize data path, control, memory and functional unit granularity to match a finite subset of applications, as described previously. The way in which this is accomplished will be clear relative to a discussion and presentation of figures relating to the different units or blocks or circuits and their inter-operations of the processor 22, as presented below.
“Quasi-Adiabatic Programmable” or Concurrent Applications of heterOgeneous intercOnnect and functionaL units (COOL) Processors. In term of thermodynamics, Adiabatic Processes do not waste heat and transfer all the used energy to performing useful work. Due to the non-adiabatic nature of existing standard processes, circuit design, and logic cell library design techniques, one can not ever make an Adiabatic Processors. However, among the possible different possible processor architecture some may be closer to Adiabatic. The various embodiments of the present invention show a class of processor architectures which are significantly closer to Adiabatic as compared to the architectures of prior art, while they are, nevertheless, programmable. They are referred to as “Quasi-Adiabatic Programmable Processors”.
The integrated circuit 20 allows as many applications as can be supported by the resources within the processor 22 to be executed together or concurrently and the number of such applications far exceeds that which is supported by current processors. Examples of applications that can be simultaneously or concurrently executed by the integrated circuit 20 include but are not limited to downloading an application from a wireless device while decoding a movie that has been received, thus, a movie can be downloaded and decoded simultaneously. Due to achieving simultaneous application execution on the integrated circuit 20, which has a small die size or silicon real estate as compared to the number of applications it supports, costs of manufacturing the integrated circuit are significantly lower than that which is required for multiple devices of
Each of the blocks 74-80 can execute only one sequence (or stream) of programs at a given time. A sequence of program is referred to a function associated with a particular application. For example, FFT is a type of sequence. However, different sequences may be dependent on one another. For example, an FFT program, once completed may store its results in the memory 70 and the next sequence, may then use the stored result. Different sequences sharing information in this manner or being dependent upon each other in this manner is referred to as “stream flow”.
In
The instruction memories 82, 84, 86 and 88 are used to store instructions for execution by the blocks 74-80, respectively.
Optionally, shared registers 326 and 328 cause communication directly between two types of sub-processors. For example, in
In
The blocks 406 and 408 perform the majority of actual computation on data. The Load/Store MFU block 402 computes addresses for accesses made to/from the memory 312 and the memory 410. The Vector X MFU block 404 rearranges vector data on its way between the memory 312 and the block 408. The Vector X MFU block 404 is also used to generate vector store masks for vector stores to the memory 312. The block 406 only operates on one piece of data at a given time, whereas, the blocks 404 and 408 operate on data in the form of vector. The block 402 provides addresses for memory accesses. Some computation is performed by the block 402 but it is in the nature of overhead computations.
A machine instruction encodes (as needed) separate operations for the various MFU blocks in addition to operations to move data between MFU blocks. All operations in a single instruction are executed in parallel. The Vector X MFU block 404 causes rearranging of vector data and generation of vector store masks under the control of separately encoded operations in instructions. The local memory 410 is used for storing information locally to avoid having to access information externally to the block 74 for every instruction. The bus 412 is coupled to the memory 312 through which memory addresses are provided.
The block 402 is shown coupled to the block 44 through a bus 424, the block 402 is further shown coupled to the block 406 through a bus 426, the block 402 is further shown coupled to the block 410 through the bus 428. The blocks 404, 408 and 410 are shown coupled to each other through a vector bus 420 and the blocks 406, 404, 408 and 410 are shown coupled to each other through a scalar bus 422. A bus is generally a group of wires, each wire coupling a signal wherein the wires are parallel to each other and thus, capable of coupling signals in parallel. The number of wires within a bus defines the number of binary bits, which serves as a characteristic of the bus. In
The block 404 also provides vector store mask, which is coupled onto the bus 416.
Memory data is coupled onto the block 406 for computation operations, from the block 402, but vector data is first provided to the block 404 . . . . It is significant to note that the block 404 offers the ability to organize data in memory to match that which is needed in the computation unit, i.e. the block 408, thereby greatly increasing performance.
The block 502 is coupled to other blocks of the block 402, as shown in
The address registers of the block 402 and circular buffer registers of the block 404 provide inputs to the address generators of the blocks 506 and 508. In the case of the address registers of the block 402, those inputs are previously stored addresses, while for the circular buffer registers of the block 404, those inputs are information about circular buffers.
The blocks 506 and 508 serve to modify addresses. Namely, the block 506 serves to modify the addresses generated by the block 504 or address received from the block 406 or even the addresses generated from the block 502, while the block 508 serves to modify addresses received from the block 502 and/or the block 406 and even the block 504. The output of the block 506 is then provided as input to the mux 512, which also receives, as input, the addresses generated by the block 502. The mux 512 then selects one of its inputs and couples the same onto the bus 520 for reception by other blocks of the block 74, as shown in
Thus, the Load/Store MFU can generate two addresses in parallel. An address is computed by combining an address register and either a constant or a value from the Scalar ALU MFU. A computed address can optionally be wrapped around within the bounds of a circular buffer. Computed addresses are primarily intended for use in accessing memories, but may also be assigned to address registers or circular buffer registers, or used as inputs to other MFUs.
The structures of
The registers and feedback paths (coupling) of
The mux 732, receives as input, outputs generated by the block 718 and 720 and the mux 730 receives inputs generated by the blocks 704 and 706 and further generates an output that is received by the block 702. The output of the blocks 708 and 722 are provided to the block 406. N, as used herein is an integer value, for example, N ALUs is an N number of ALU circuits.
The blocks 702-714 and the mux 730 generally perform a multiply accumulate (MAC) function, whereas, the blocks 716-726 and the mux 732 perform an ALU function, however, the number of bits, in parallel, on which such MAC and ALU functions are performed is generally N times greater than the number of bits processed by the block 406. The blocks 704 and 712 are segmentable, that is, they are capable of selectably segmenting add operation. For example, in the case where N 32-bits are being processed, in parallel, in addition to being able to perform N 32-bit add operations, each ALU block is capable of performing 2N 16-bit add operations, or 4N 8-bit add operations. The block 714 functions in the same manner as that of block 1110 of
The block 706 shifts a vector value, i.e. an N M-bit value, to the right or left by an integer value. An example of a vector shift would be to take a vector such as
in this case eight values, and return the vector
or perhaps
These operations would not usually be interpreted as any sort of multiplication or division. The block 708 allows choosing a single element of a vector value, for example, a particular byte (eight bits) can be selected out of the vector value.
The block 720 functions in a similar manner as the block 706 and the block 726 functions in a similar manner as the block 710. The output of the blocks 712 and 726 are selectively provided to the block 702, through the mux 704 and the output of the blocks 706 and 704 are selectively provided to the block 702, through the mux 730. Furthermore, the outputs of the blocks 720 and 718 are selectively provided to the block 716 through the mux 732.
The block 722 performs an addition operation on a vector basis, whereas, the other blocks of the block 408 operate on an element basis. That is, the block 722 adds all of the elements of a single vector together and the blocks that operate on an element basis perform an operation on one or more of a selected and corresponding element(s) of different vectors.
The blocks 710 and 726 each allow a conversion from N or 2N, selectively. Further shown in
The block 802 is shown to receive input from other blocks of
In one example, the block 404 has a register file, the block 808, of N*32-bit vector registers, for the same N as the block 408. The block 806 of the block 404 includes mask registers of size N*4 bits. Each bit of a mask register corresponds with one byte of a vector register. When an N*32-bit vector is stored to external shared memory, an N*4-bit mask can be supplied to indicate which bytes of the vector are actually to be written to memory. (Memory bytes corresponding to a zero bit in the mask are left unchanged.) A mask generator function computes a 4*N-bit mask based on the setting of a mask control register.
The block 404 can permute the 8*N bytes of two vector registers to choose 4*N bytes. In the general case, the specific permutation is controlled by the value of a third vector register. Certain “precoded” permutations do not require the use of a control vector; these include all funnel shifts left and right of the two input vector registers. At the same time that the 8*N bytes of two vector registers are permuted, the 8*N bits of two mask registers can be identically permuted to maintain the same bit-for-byte correspondence between mask and vector values.
The blocks of
The blocks 802, 804, 806 and 810 of
SIMD is an acronym for _Single Instruction, Multiple Data_, and MIMD is _Multiple Instruction, Multiple Data_. These are standard terms in computer architecture and programming known to those skilled in the art.
The block 1102 is shown coupled to the memory 312 and other blocks of
The block 1112 is shown coupled to the block 1116, which generates an output provided as input to the block 1112. The block 1118 is shown coupled to the block 1112 and blocks 1106 and 1110 are shown coupled to the block 1112.
The blocks 1102, 1104, 1106, 1108 and 1110 and the mux 1122 cause an ALU function to be performed while the blocks 1112-1118 and the mux 1124 cause an a multiply-accumulate (MAC) function to be performed.
The blocks 1104 and 1108 are ALUs and perform such functions and their output is selectively, through the muxes 1122 and 1120, provided as input (or feedback) to the block 1102. In every clock cycle, two ALU operations may be performed. The block 1110 performs a multiply function and produces an output that is provided to the block 1112, which is capable of processing a higher number of bits, in parallel, than that of the block 1102. For example in the case where the block 1102 has a 32-bit capability, the block 1112 has a 40-bit capability. The block 1112 serves an accumulator register, i.e. adding inputs accumulatively.
The block 1106 converts an N-bit value to an N+X, where X is an integer value. For example, a 32-bit value can be converted to a 40-bit value. The block 1114 shifts a value by a predetermined number of bits and passes the result to the block 1102, through the mux 1122.
The block 1118 converts from a higher number of bits to a lower number of bits, such as 40 bits to 32 bits. The block is coupled to the block 408. The block 406 can execute two ALU operations in parallel on values from the block 1102. In place of the first ALU operation, an N-bit shift operation may be performed, or a conversion of an N-bit value to a X-bit value to be stored in the block 1112. In place of the second ALU operation, a multiplication may be performed by the block 1110 and the result stored in one of the registers of the block 1112.
The block 406 can, in parallel, perform a 40-bit shift, a 40-bit add/subtract, and a conversion of a 40-bit value to a 32-bit one to be stored in one of the Scalar ALU MFU's 32-bit registers.
Further details of one of the N-type sub-processors, such as the block 78 will now be discussed with reference to figures to follow. It should be noted that the blocks 406 and 404 of
The block 1306 is shown further coupled to the macro function blocks 1340, which is, in turn, shown coupled to the block 1302 through a macro function bus 1310. The block 1302 is shown to include a store buffer 1314, a store buffer 1312 and a bus interconnect block 1308. The block 1302 generates an output provided to memory, such as the memory 312 and therefore coupled accordingly through the block 1314. The block 1304 is shown to receive input or be coupled to memory, such as the memory 312. The block 1306 is shown to include a load buffer 1320, a load buffer 1318 and a bus interconnect block 1316, which is coupled to the blocks 1340.
The blocks 1340 are shown to include a Galois field MAC block 1322, a special ALU block 1324, a combiner block 1326, a memory 1328, a puncturing/depuncturing block 1330, an interleaver block 1332 and an viterbi block 1334, which are each shown coupled to the bus 1310. The blocks 1322-1332 are each shown to receive input from or be coupled to the block 1316. The block 1334 receives input from the block 1332 and is coupled to receive and generate data thereto.
The flow of data is such that data or information flows in from and through the block 1306 to the blocks 1340 and then to the block 1302 and out onto memory. In this manner, a pipeline affect is introduced wherein multiple operations overlap and are processed concurrently, in a pipeline fashion. For example, information may be loaded by the block 1306 while information is being stored into memory by the block 1302. Data is stored in the blocks 1320 and 1328 of the block 1306 after being received by the block 1304 from memory and subsequently provided to and processed by the blocks 1340, the details of which will be discussed shortly with respect to subsequent figures.
Upon completion of processing by the blocks 1340, the processed data is provided to the block 1302, through the bus 1310, and stored in the blocks 1312 and 1314 wherein they are stored until coupled to be received by memory. The buffers of the blocks 1314, 1312, 1318 and 1320 are of a predetermined width or number of bits, in parallel. In one example, each of these buffers is 256 bits wide, however, other number of bits may be employed.
A value or data, that may have been processed by the blocks 1340, may be moved from the block 1302 to the block 1306 for re-use. Furthermore, data may be received by the block 1304 from memory and then moved tot the block 1306 for processing thereof. Further details of each of the blocks 1340 are now presented. The blocks 1314 and 1312 cause a double buffering effect, which assists in reducing “stalling” commonly experienced in pipelining operations, as do blocks 1318 and 1320. Stalling results from access of blocks 1302 and 1306 simultaneously by memory. In another embodiment, the blocks 1314 and 1312 may be one block and the blocks 1318 and 1320 may be one block.
A latency may be associated with an operation or a pipeline affect may be present. The latency may result from each of the blocks with the blocks 1340.
The output of the block 1406 is shown coupled to the circuit 1404 as another input thereof. The output of the block 1404 is provided to the block 1406, such coupling effectuates the MAC part of the Galois field MAC operation. The block 1404 effectively performs an XOR multiply operation typically used in Galois field MAC operations.
The block 1402 is shown to include a register block 1420 and a register block 1422, which are shown coupled to an Xor tree block 1424. The block 1420 is further shown to include a register block 1426, a Galois field multiply iteration 11428, a register block 1430, a Galois field multiply iteration 11432, a register block 1434 and a register block 1436. While not shown in
The block 1424 is shown coupled to the block 1426, which is, in turn, shown coupled to the block 1428, which is, in turn, shown coupled to the block 1430, which is, in turn, shown coupled to the block 1432, which is, in turn, shown coupled to block 1434, which is coupled to either the block 1436 or one or more register blocks intermediately located between the blocks 1434 and 1436.
In
In operation, the block 1322 operates on an N-bit value or data, such as an 8-bit value, and based on the same generates an N-bit value or data by shifting the original value eight ways based on another N-bit value. The N-bit values are then XORed by the block 1404 until the result is reduced to N bits with a reduction constant and optionally added with the contents of an N-bit accumulator register, such as a value in the block 1406. A “Clear” operation may also be performed by the block 1406. Example of applications employing Galois field MAC operations and therefore block 1322 include but are not limited to cyclic redundancy code (CRC) operations, convolutional encoder operations, scramble code generator operations and others.
The blocks 1508 and 1506 are shown to generate inputs to a conditional register block 1512 and further shown coupled to generate inputs to the add/sub/Abs/diff/conditional add-sub/multiply (AGU) block 1510, which, in turn, generates input to the output register block 1514. The block 1514 is shown coupled to a mux 1516, which is, in turn, shown coupled to an adder 1518. The adder 1518 is shown coupled to an accumulator-register block 1520, the output of which is shown to serve as another input of the adder 1518. Another output of the block 1520 is shown to serve as input to a mux 1522, which receives, as another input as output of the block 1514. The mux 1522 generates an output 1530 which is coupled to the bus 1310. Some of the inputs to the muxes 1504 and 1502 are received from the block 1316.
Each of the muxes 1504 and 1502 is shown to receive four inputs. One of the inputs of the mux 1504, dp, is received from the block 1306, as is the input, dp, of the mux 1502. Another input of the mux 1504 comes from a series of the lowest-order bits of an output of the block 1514, as does one of the inputs of the mux 1502. Another input of the mux 1504 comes from the highest-order of bits of the same output of the block 1514. Yet another input of the mux 1504 is a value ‘0’. One of the inputs of the mux 1502 is the value ‘1’ and another one of its inputs is the value ‘−1’. The values ‘0’, ‘1’ and ‘−1’ are provided in an effort to expedite the operations performed by the block 1324 in that it has been experienced that these values are repetitively utilized in various operations and therefore there presence increases system performance. It should be noted that there might be a plurality of the blocks 1510 utilized for increased performance. The block 1324 is organized as shown in
In operation, the blocks 1510 and 1512 operate on the A and B values provided by the blocks 1508 and 1506, respectively. Two other inputs to the mux 1516 are generated by a reduction operation block within the block 1520 (not shown in
The block 1512 is a 2N wide register that allows conditional add or conditional subtract operations to be performed by the block 1510 for use in despreading operations. The block 1512 essentially modifies the A and B values for use by the block 1510.
The mux 1522 allows essentially the output of the block 1510 upon having been stored by the block 1514 to be selectively provided to the block 1302, through the signal 1530, and as determined by a select signal provided as yet another input to the mux 1522. Otherwise, the result of the block 1510 undergoes an accumulation-add operation, the final result of which is stored in the block 1520, through the blocks 1518 and 1520 prior to being provided to the block 1302.
The block 1324 is an N-layer ALU including one or more ALUs that support the following operations:
The block 1510 is common to the W-type sub-processor wherein each block 1510 is capable of reading at least 128 bits and thus, the two blocks are capable of reading at least 256 bits of data every clock cycle when there is no contention in memory.
The result of each of the blocks 1602-1608 is made available to another block. For example, the result of the block 1602 serves as input to the block 1604, the result or output of the block 1604 serves as input to the last acc-reg block within the block 1608 and the result or output of the block 1606 serves as input to the block 1608. Because the results of the blocks are provided in a forward manner and simultaneously with the accumulation of the stages within a block, only seven cycles are required to perform a reduction operation when a four-stage acc-reg block is employed.
The block 16 is comprised of a mux coupled to an accumulator. The mux is a 2:1 mux selecting one of two inputs to be provided to the accumulator. One of the two inputs of the mux of block 1610 is provided by the output of the block 1514 and the other input is the result of the previous-stage acc-reg block. In this manner, the reduction function of
The block 1714 includes a plurality of registers including the registers 1716 through 1746 that used to create a combination of output of the shifters 1702-1712. For example, the lower eight bits of each of the shifter 1702-1712 output can be made to go through a mux to selectively choose which of the lower eight bits are to be ultimately generated. Thus, each of the registers of the block 1714 can arbitrary select among an “interesting position” of shifted bits. The interesting position is determined by the output of each of the shifters 1702-1712. The output of the block 1714 is provided to the bus 1310.
Thus, in one embodiment of the present invention, the block 1326 comprises four 20-bit and two 24-bit input registers. It includes eight 16-bit registers where random 32, 16, 8 and 4-bit combinations of bits from its input registers is created and stored. The block 1326 can be used in three modes: Using two specific 20-bit registers for output generation; 2) Using four 20-bit registers for output generation; or 3) Using all seven registers for output generation. The shifters 1702-1712 include input registers not shown due to the known structure and function of a shifter by those skilled in the art.
In order to reduce the hardware or number of blocks or circuits required to perform the combining function of the block 1326, each bit in the 32-bit output register can only be filled from the least significant 8-bits in the two 20-bit registers in the fist mode, the 4 least significant bits in the four 20-bit registers in the second mode, and the 2 least significant bits from the four 20-bit registers and 4 least significant bits in the 24-bit registers in the third mode. Random combinations from the input registers is a two-step process where the first step involves shifting the “interesting’ bit s to the least significant positions from where random filling into the output register can be allowed in that mode. In the example used herein with respect to
The memory 1326 is a generic random access memory and will therefore not be discussed in further detail. Suffice it to say however, that the size of the memory is based upon the applications for which the N-type sub-processor is to be used.
The input to the circuits of the block 1330 are generated from the block 1332, which will be discussed shortly but for now, generates either fully interleaves, partially interleaves or un-interleaves N-bit words to the block 1330. In one example the operation is on 256-bit words, in which case, the block 1330 operates on 16-bits at a given time. A prefetched control word is used to decide which bits within the 16-bit word must be inverted. Optionally, a ‘0’ or a ‘1’ value is entered into specific bit positions in addition to inversion.
The block 1334 is capable of executing turbo-decoder, SAD and despreading functions. In one example, 32 to 256 add-compare-select operations can be performed, in parallel, by the block 2004, on 16-bit branch and path metric values generated by the local memory 2006. In one example, the size of the local memory 2006 is 1 kilobits and 16 kilobits.
There may be a plurality of blocks 2004 included in the block 1334 each of which may include 8-bit signed adders. Additionally, each can include a compare and a select block that returns the winning path and the decision bit. The add-compare-select operations result in a winning path and decision bits. The winning path can be shared with neighboring blocks 2004 using a “multi-cast’ interconnect scheme for going down the trellis. Decision bits with the winning branch and path metric values are stored for backtracking.
The block 2008 uses four eight-bit ALUs, in one example, four absolute differences of which can be calculated every cycle. A reduction tree is built into the block 2004 to accumulate the absolute differences into a 16-bit accumulator. The multi-cast network can be used to send these values across for further reduction. A total of 128 8-bit (64 16-bit) blocks 2008 are possible per clock cycle. However, it is believed that the effective utilization considering all of the overheads might result in a lower number.
The ALUs implement the same conditional add-subtract function that the special ALU block implements and discussed hereinabove. The control bits needed for despreading must be loaded into the local memory from where it is fetched and stored in a register. The results are accumulated into a 16-bit accumulator from where I can be transferred to other blocks 2004 for reduction operation thereon. With despreading, in one example, it is possible to perform 128 simultaneous conditional add-subtracts in a single cycle. The energy per transition in this unit is higher than that used for the special ALU serving some general functions other than despreading and SAD. For smaller number of fingers or for lower rate motion estimation, the special ALU is a more power efficient option.
Scaling of the processor 20 results in clusters of four sub-processors with separate buses for each cluster, otherwise, four sub-processors can share a single memory. Scalability with respect to processors has generally been by way of increasing the number of processors or increasing the frequency or speed of the processor. However, complex applications require scaling beyond that which has been previously done. In the present invention, the W type and N type sub-processors are modified so that four such sub-processors forming a processing can process a single application.
Accordingly, the processor 22 is equipped with capability to run control and sequential DSP code found in targeted applications more efficiently than RISC and Super Scalar processors directly based on compilation from C code. At the same time, it is designed to take advantage of automatic code generation techniques used in RISC and Super Scalar processors for legacy and light applications. Furthermore, the processor 22 works with matured and industry standard software tools like Simulink for application mapping and development. Moore's Law can be utilized to enhance performance of the processor 22. The processor 22 is not only a highly parallel machine but also a heterogeneous multi-processor. It is a proven fact in both industry and academia that parallel heterogeneous multi-processors are required to address demanding multimedia and communications applications. It allows utilization of many of the automatic code generation techniques used in VLIW without using any power and area inefficient techniques. It is optimized to take advantage of repeating patterns based on compilation of control code from C. This significantly reduces control power and makes it possible to run compiled serial code efficiently. Additionally, the processor 22's programming model is designed to suit a large community of DSP programmers using tools familiar to them like Simulink. Its development flow provides the means for efficient C-compilation of the control and sequential DSP code. Also an extensive set of library of highly efficient communications and multimedia kernels are provided. Examples are parameterized library of FFT, IDCT, RRC, Viterbi, VLC, 2D/3D Graphics, Turbo codec, and De-scrambler.
The data path design in the processor 22 successfully integrates varying interconnect structures connecting functional units of varying granularity to effectively address a focused yet highly lucrative application mix.
The scalability of the processor 22 is designed to fit all applications in a single block (time multiplexed) with nearest neighbor connections within a block based on standard SoC bus. Considerable amount of inefficiency and all the system level non-determinism is reduced because multiple blocks can be used to process multiple applications without any proprietary communication between them.
In
In
In
In one embodiment of the present invention, the output 4030 is 56 bits and the connections 4003 and 4001 are each 8 bits. While some examples, including the foregoing, are presented as to the number of bits, it is contemplated that any number of bits may be employed.
In operation, the GF MAC 4039 performs operations on elements from the field GF (2ˆm), where 1<=m<=8 and m is an integer value. The size of the field and the field's generating polynomial are specified using a matrix of coefficients, stored in the register file 4031, and programmably and therefore flexibly provided by a user. This allows any generator polynomial to be used. Furthermore, in another embodiment, multiple GF MACs, such as 16 are employed, neighboring each other wherein the output of an input neighboring GF MAC is provide at the connection 4017 and the output of the GF MAC 4039, shown at the connection 4023 is provided to an output neighboring GF MAC.
Values in matrix of coefficients stored in the register file 4031 are shared between all the sixteen GF MACs. Each of the GF MACs 4039 programmably performs conditional addition, multiplication, or multiply-accumulate operations. Input operands are provided by two eight-bit input registers, 4006 and 4011. The registers 4006 and 4001 may be loaded from the load path 4000; the register 4011 may also take result values. Result values are generated onto the connection 4027 and the connection 4023 from within the GF MAC 4039. The result of addition or multiplication, by the GF multiply 4014 is stored in an eight-bit result register, such as the register 4028. The result may be accumulated with the value stored in the eight-bit ACC 4021 or from the ACC 4021 in a neighboring GF MAC.
The GF MAC 4039 retrieves stored coefficients from the register file 4031 and stored input values from the registers 4006 and 4011 to perform a multiplication operation in the Galois Field. As used herein, input selections or muxes are used to choose or select between different inputs and registers are used to store values. The GF add 4038 performs a Galois Field addition operation. This unit is dedicated for finite field arithmetic that is not found in programmable DSPs. These operations are either implemented in hardware in an Application Specific Integrated Circuit (ASIC) that is designed to do a fixed function or implemented in software on functional units that perform typical integer arithmetic (which is different from finite field arithmetic). Hardware implementation of finite field arithmetic in a DSP with programmable parameters is a very efficient tradeoff (versus implementing it all in software or all in hardware).
In general use, variable data (e.g., message symbols and syndrome elements) is loaded into the register file 4006 and constant data are stored in the register 4011. As noted above, previously computed results may be feedback through the register 4011 to facilitate computation. Each GF MAC 4039 passes a copy of the value stored in its register 4021 to its nearest neighboring GF MAC to facilitate implementation of polynomial division. The GF MAC 4039 has a latency of one cycle and one turnaround cycle for all operations. An example of the latency and turnaround associated with various operations of the GF MAC 4039 is shown in Table 1.
The load path 4000 is shown coupled to the input select 4074 through the connection 4040 and to the input select 4042 through the connection 4041. The input select is further shown to receive, as input the output of the register 4067 and another of the outputs of the register 4068, the output 4049. The input select 4042 is shown to receive as input, the outputs 4065 and 4049. The input select 4074 is shown to generate an output to the register 4072, which is shown to provide input to the register 4045 and to the SALU function 4047. The input select 4042 is shown to generate an output 4043 to the register 4044, which is shown to generate an output to the SALU function 4047. The register 4045 generates an output 4076 to the SALU function 4047. The output of the SALU function 4070 is provided to the register 4067. The register 4067 is further shown to provide an output 4065 to the source select 4053 and to the output select 4061. The source select 4053 is shown to receive as input, outputs 4048, generated by the neighbor ACC 4050 and 4050, generated by the reduction ACC 4051.
The output 4052 of the source select 4053 is shown coupled to the shifter 4055. The shifter 4055 is shown to generate an output 4056 to the summer 4063 and the output 4064 of the summer 4063 is provided as input to the ACC 4057. The output 4058 of the ACC 4057 is shown provided to the shifter 4059, which generates the output 4060 serving as input to the output select 4061. The output 4062 of the output select 4061 is shown coupled to the store path 4036.
In an exemplary embodiment of the present invention, the ALU 4079 is replicated a multiple number of times with neighboring ALUs feeding into one another in a chain. For example, the neighbor ACC 4050 stores a value from a neighboring special ALU, similar to the ALU 4079. In one example, 16 instances of special ALUs are employed.
In another embodiment, each ALU 4079 has two 8-bit input registers, i.e. the registers 4072 and 4044 are each eight bits and an internal 16-bit result register 4067, and a 16-bit accumulator (ACC) 4057. Result values may be accumulated un-shifted or after an optional shift has been applied. The ACC 4057 may be shifted prior to storage in the store path 4036.
The ALU 4079 maintains three condition flags: a zero flag, Z; a negative flag, N; and a carry flag, C. N, C, and Z maybe used to predicate execution. The carry flag maybe included in add or subtract operation to support multi-precision arithmetic. By default 4079 performs integer arithmetic. Support for fractional fixed-point arithmetic is provided through two shift units. The first shift unit performs either a left shift of one bit, no shift, or a right shift of 1, 2, or 3 bits of the R register 4067 prior to accumulation. The second shift unit may be used to optionally shift the accumulator either one bit to the left or up to six bits to the right prior to output onto the load path 4000.
The ALU 4079 does not perform saturation. An internal 16-bit conditional execution registers (CR), registers 4069 and 4045, are used for conditional add/subtract operations.
An accumulation of the accumulators from each ALU 4079 may be performed using the summation operation. Special ALU arithmetic and logical operations are summarized in Table 2.
In
The register 4081 is shown coupled to the input select through an input 4082, generated by the register 4081. The input select 4085 is shown to further receive as input, a bitstream input 4082. The input select is shown to generate an output 4084 that is provided as input to the generator 4100. The register 4087 is shown to receive an input 4087, which is shown coupled to the load path 4000. The register 4089 is shown to receive an input 4088, which is shown coupled to the load path 4000. The registers 4086 and 4089 generate outputs 4091 and 4090, respectively, which are provided as input to the generator 4100.
The generator 4100 is shown to provide an output 4092, provided as input to the register 4094. Further, as output, the generator 4100 is shown to generate a bitstream output 4093, which also serves as another input to the register 4094. The register 4094 provides an output 4095 to the store path 4036.
In an exemplary embodiment, a multiple number of combiners 4104, such as an N number, is used with N being an integer value and 1<=N<=128. Each combiner can be used to scramble data using a pseudo-random bit sequence (PRBS) or to compute a CRC checksum. The register 4081 may be 32 bits, in an exemplary embodiment and for input data. A register 4089 is a 32-bit shift register and holds the PRBS state and the current value of the CRC checksum. A 32-bit control register defines the feedback connections applied to the S register 4089. A 32-bit output register 4094 either accumulates the results of the scrambling operation or is used to hold the current CRC checksum.
In data scramble mode, input data is written to 4081 and the PRBS seed value is written to the S register 4089. A control word is loaded into the data scrambler control register. For each 1′b1 in the control register, the corresponding bits from the S register 4089 are XOR-ed together. The result is shifted into the most significant bits of the S register 4089 and further XOR-ed with the next data bit from the A register 4081. The result of the XOR is shifted into the result register 4094.
In the CRC mode, the S register is initially cleared. The CRC generation polynomial is written to the control register. For each 1′b1 in the control register, the corresponding bits from the S register 4089 is XOR-ed together. The result is XOR-ed with the next shifted bit out of the A register 4081. The result is shifted into the most significant bit of the S register. After all data has been processed through the CRC generator, the result can be written to the result register 4094 and from there to the store path 4036.
In
In operation, the interleaver 3015 supports one- and two-dimensional permutations of bits. Bits may be written into the memory 3009 in unpermuted order, and read from the memory 3009 in permuted order, or written to the memory 3009 in permuted order, and read from the memory 3009 in unpermuted order.
The bit memory is accessed by specifying the address of a bit, byte (8 bits), halfword (16 bits), or word (32 bits) to read or write. The addresses are generated by AGU13001 or AGU23012 based on input values provided on the wires 3000 and 3004. The bit address can be considered to be permuted or unpermuted. The permutation is defined using control bits contained within each AGU. The interleaver 3015 of
The AGU provides program controlled hardware support for generating addresses such that reading and writing from/to the AGU results in interleaving of data. Typically, this would either be done in hardware without program control or in software without hardware assists. The hardware solution is not flexible to accommodate different interleaving schemes that may exist. The software solution is not power and die-size efficient. A program controlled hardware support is an efficient tradeoff for reducing the area and power while accommodating different modes of operation.
In one embodiment of the present invention, the input 3043 is eight bits and the input is 128 bits and the input 3040 is 4 bits. The input (3040) is coupled directly to the input register A (3041). Input register D (3046) is coupled to 3053 using the 128-bit wide wire 3044. Input register D is coupled to the LUT memory using wire 3045. The output of 3047 is coupled to the output register O (3049) using wire 3048. Output register O is coupled to the store path (3052) using wire 3050.
In operation, the memory 3047 supports rapid transformation of input data values to output data values using a user-specified look-up table (LUT) and a user-specified mapping. Data values are presented as addresses into the register 3041 using either the input 3040 or from the load path via the input 3043. The user specifies the mapping by writing specific values into the memory 3047 via the register 3046. Variable amounts of data can be read out from the memory 3047 and placed on the output 3048 to support different mapping algorithms. Once data is in the register 3049, it may be written to the store path 3052. The embodiment of
In
In operation, the unit 3018 performs data encoding and puncturing operations used in forward error correction algorithms employed in communications protocols. Encoding adds redundancy (additional bits) to input data bits to ensure reliable reception. The encoding scheme to use is specified by the user by programming the registers G 3031 and C 3037.
Puncturing removes specific bits from the encoded bit stream (the input 3023) to reduce the number of bits that must be transmitted or stored. The pattern of bits that is removed is user-specified by programming register P 3034. The data bits to be encoded are provided by writing values to the register A 3021 or by reading the bitstream input 3023. The source of bits is selected by programmably controlling the mux 3024. The unit 3026 serves to encode a single input bit and optionally punctured to produce between 1 and 4 output bits onto the output 3027. The output bits can be passed to other functional units via the output 3027, which is also labeled bitstream output or collected into a larger width word in the register 3028 for storage using the store path 3019.
The embodiment of
Although the present invention has been described in terms of specific embodiments, it is anticipated that alterations and modifications thereof will no doubt become apparent to those skilled in the art. It is therefore intended that the following claims be interpreted as covering all such alterations and modification as fall within the true spirit and scope of the invention.
This application claims the benefit of U.S. Provisional Patent Application No. 60/791,765, entitled “CoolN Documentation and Usage Notes”, filed on Apr. 12, 2006 and is a continuation-in-part application of U.S. patent application Ser. No. 11/180,068, entitled “Programmable Processor Architecture” and filed on Jul. 12, 2005, the disclosures of both of which are incorporated herein by reference as though set forth in full.
Number | Date | Country | |
---|---|---|---|
60791765 | Apr 2006 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 11180068 | Jul 2005 | US |
Child | 11733707 | Apr 2007 | US |