This invention relates to processors for executing stored programs, and in particular to a vector processor employing special purpose registers to reduce instruction width and employing multi-pipe vector block matching.
Vector processors are processors which provide high level operations on vectors, that is, linear arrays of numbers. A typical vector operation might add two 64-entry, floating point vectors to obtain a single 64-entry vector. In effect, one vector instruction is equivalent to a loop with each iteration computing one of the 64 elements of the result, updating all the indices and branching back to the beginning. Vector operations are particularly useful for image processing or scientific and engineering applications where large amounts of data must be processed in generally a repetitive manner. In a vector processor, the computation of each result is independent of the computation of previous results, thereby allowing a deep pipeline without generating data dependencies or conflicts. In essence, the absence of data dependencies is determined by the particular application to which the vector processor is applied, or by the compiler when a particular vector operation is specified.
A typical vector processor includes a pipeline scalar unit together with a vector unit. In vector-register processors, the vector operations, except loads and stores, use the vector registers. Typical prior art vector processors include machines provided by Cray Research and various supercomputers from Japanese manufacturers such as Hitachi, NEC, and Fujitsu. Processors such as provided by these companies, however, are usually physically quite large, requiring cabinets filled with circuit boards. Such machines therefore are expensive, consume large amounts of power, and are generally not suited for applications where cost is a significant factor in the selection of a particular processor.
One technology where reduction in cost of processors greatly expands markets is image processing. There are now many well known image encoding and decoding technologies used to provide full-speed full-motion video with sound in real time over limited bandwidth links. Such applications are particularly suitable for lower cost video processors. Reduction in the cost of such processors, however, requires substantial reductions in their complexity, and implementation of such processors on integrated circuits typically precludes the use of 64-bit instruction words. The reduction in instruction width, however, so diminishes the capability of the processor as to render it less than desirable for such image processing, scientific or engineering applications.
This invention provides a vector processor with limited instruction width, but which provides features of a processor having a greater instruction width by virtue of a special purpose register, and the referencing of that register by various instructions. This enables a limited width instruction to address the vector memory and provide the functionality of a larger processor, but without requiring the space, multiple integrated circuits, and higher power consumption of a larger processor. In addition, the simplicity of the design enables implementation on a single integrated circuit, thereby shortening signal propagation delays and increasing clock speed. The special purpose registers are set up by a scalar processor, and then their contents are reused without the necessity of reissuing new instructions from the scalar processor on each clock cycle. All vector instructions include a special field which indexes into these special registers to retrieve the attributes needed for executing the vector instructions.
In a preferred embodiment the vector processor includes a set of vector registers for storing data to be used in the execution of instructions and a vector functional unit which is coupled to the vector registers for executing instructions. The functional unit executes the instructions in response to operation codes provided to it, and those operation codes include a field which references a special register. When each instruction is executed reference is made to both the operation code and the special register, and the contents of both the operation code and the special register are used for the execution of the instruction. In one implementation, each vector instruction includes a length and a starting point, and a special register is used to store the information about the length and starting point for each vector instruction.
The invention also provides a memory organization for efficient use of the processor. In particular, a memory architecture is provided in which pipelined accesses are made to groups of banks of SRAM memories. A retry capability is provided to allow multiple accesses to the same bank. Data is moved into and out of the banks of SRAM using a parallel loading technique from a shift register.
Preferably the memory system includes a group of access ports for enabling access to the memory, a set of address lines and a set of data lines coupled to the access ports to receive address information and data from the access ports, and a pipelined series of address decoder stages coupled to the address lines. As addresses arrive, they are transferred from decoder to decoder, and each decoder compares the address on the address lines with a set of addresses assigned to that decoder corresponding to the memory banks associated with it. A first set of memory banks is coupled to the address lines and the data lines between a first address decoder and a second address decoder in the series of address decoders, and a second set of memory banks is coupled to the address lines and the data lines after the second address decoder in the series of address decoders. A shift register connected to each of the sets of memory banks enables bock loads and stores to the memory banks.
An additional aspect of the invention is the provision of instructions for invoking the special register described above. This register stores information about the length and starting point for each vector instruction. In one embodiment a computer implemented method for executing a vector instruction which includes an operation code and references to various registers, includes the steps of decoding the vector instruction to obtain information about the operation code defining the particular mathematical, logical, or other type operation to be performed on a vector. At the same time the vector instruction is decoded to obtain an address of a first vector register where the at least one vector upon which the operation to be performed is stored, the address of a second vector register where the result of the operation is to be stored, and the address of a third register which stores the starting element and the vector length. The vector instruction is then executed using information from the first and third registers.
b is a diagram illustrating the G register of
This invention provides a vector processor which may be implemented on a single integrated circuit. In a preferred embodiment, five vector processors together with the data input/output unit and a DRAM controller are implemented on a single integrated circuit chip. This chip provides a video encoder which is capable of generating bit streams which are compliant with MPEG-2, Windows Media 9, and H.264 standards.
The scalar processor will typically be a single issue design with hardware interlocks. Instructions issue in order and complete in order with instruction decode requiring one clock. All operations performed by the scalar processor are 32 bits, but support 32, 16, and 8-bit data values. All execution units complete in one clock except the multiplier which requires four clocks, data cache loads which require three clocks, and the 32-bit shift which requires two clocks.
The two banks of 32 entry scalar register files 70 provide one file for the supervisor, and another file for applications. As shown in
The scalar processor 10 has four condition code registers (c0, c1, c2, c3), each with a single flag bit. These 1-bit flags reflect the overflow (O) and carry (C) conditions. The meaning of the condition code flag depends on the type of instruction that set the flag:
signed arithmetic instruction when overflow,(MSB xor MSB+1)−>flag; (1)
unsigned arithmetic instruction when a carry=(MSB+1)−>flag; (2)
saturated arithmetic instruction, signed or unsigned, when overflow−>flag; and (3)
compare instruction(EQ, LE, . . . )−>flag. (4)
Instructions that set a condition code must specify which one of the four registers is to be used. Some instructions do not affect the condition codes. If the programmer needs a “sticky flag” (for example, to see if any result in a loop overflowed), an add with carry instruction can be used with an immediate value of 1 as an input.
ADDC R1,(R1),C1;
So if R1 is cleared before the loop and contains a 0 at the end of the loop, the conditional flag was never set and overflow never occurred in the loop.
An instruction that specifies a condition code register to be set as a result of the operation performed also modifies the CC flag. For example, an instruction that compares two registers for equality and chooses c2 as the condition code register destination will set the flag. In contrast, a logical instruction such as the logical-and instruction cannot specify a condition code register and so leaves all condition code flags unmodified.
A branch on condition instruction will not modify the cC flag. In some instructions a cC register is used as a carry in and if there is an overflow from the operation, then the same cC register is modified.
An overflow is generated when the result of an arithmetic operation falls outside the range of representable numbers, thus producing an incorrect result. In 2s complement arithmetic, overflow is detected when the MSB and MSB+1 have different signs. Both operands must be sign-extended to MSB+1. A Carry is generated when a “1” is generated in the MSB+1 position.
The Vector Mask registers (mM) 110 are used to store condition codes for the vector functional units. Each vector pipe has eight M registers that store a single bit for each element in the vector register. If the vector length is set to 32, then the M register is 32 bits. The meaning of the condition code flag depends on the type of instruction that set the flag:
Signed arithmetic instruction when overflow, (MSB xor MSB+1)−>flag
Unsigned arithmetic instruction when a carry=(MSB+1)−>flag
Saturated arithmetic instruction, signed or unsigned, when overflow−>flag
Compare instruction(EQ, LE, . . . )−>flag
At the end of a vector instruction, the M register can be moved to a scalar register and a bit reduction operation performed to check if any flags were set during the vector operation. The Mask registers can also be used to hold carry values for instructions that have a carry in. For example, if double precision (32-bit) arithmetic requires:
vaddu nVD,nVA,nVB,mM add low bits unsigned, carry to mM
vaddc nVD,nVA,nVB,mM add high bits with carry from mM
Vector Mask registers can also be used with shift instructions on the vector side. For example, if a shift instruction shifts out any value of 1, the vector mask is set. This can be used to find the largest number in a vector and then scale the vector accordingly. The M register is used in the vector merge instruction. In this case, the mask bit selects whether the element from source one or the element from source two is written to the destination register.
The data is transferred from the DRAM backing store through the high-speed system bus 140 to the SRAM. Data from the SRAM is transferred by the memory controller to the register files by the scalar processor 10, and is interlocked with the appropriate instructions in the hardware. The memory interface has a capacity of twelve 16-bit simultaneous transfers per clock.
The vector function units 210 are capable of running two operations at the same time in each vector unit. Four vector functional units can have eight operations occurring simultaneously. Each vector function unit is capable of four reads and two writes simultaneously. To keep the functional units busy, the SRAM 30 buffers feed the vector registers 200 using memory controllers. These memory controllers are programmed by the scalar processor 10, but are located in each of the functional units 210. There are three memory controllers in each functional unit, two loads and one store.
The vector processor 210 supports chaining. For example, if the first instruction issued is a multiply that stores the result in a vector register, a second instruction can issue on the next clock that reads the result in the register file from the first operation, and performs a different operation on the result of the first multiply. The hardware automatically schedules the second instruction when the result of the first operation is complete by register scoreboarding of the vector register elements.
The special “G” register file 235 is organized as eight 48-bit registers. This register file is capable one read and one write, and can be read and written by various instructions, as well as read by the SRAM load store controller 236. As will be described below in more detail, vector load and store operations use the “G” register file to obtain the desired values for a series of parameters. In the preferred embodiment these parameters include (1) vector length, (2) starting element, (3) repeat, (4) skip, and (5) stride. The bit positions where these values are stored are:
gG[47:42]<−(6-b Vector Length)
gG[41:37]<−(5-b Starting Element)
gG[36:31]<−(6-b Repeat)
gG[30:15]<−(16-b Skip)
gG[14:0]<−(15-b Stride)
The G register is illustrated in more detail in
Whenever an operation is carried out using a vector opcode, that instruction includes an index into the G register to specify the desired parameters for that operation. In the preferred embodiment, to select one of the eight 48-bit registers, the G field in the vector instruction will be three bits in length.
The vector pipe shown in
Each vector pipe also has a special purpose 40-bit register file called aACC. This register file holds the 40-bit result of each MAC instruction, and each of the two add/sub reduction 24-bit Accumulators. The Accumulator is loaded from the ACC register file at the beginning of each MAC or reduction operation. At the end of the operation the final result in the Accumulator is stored in the ACC register. This register file is dual-ported to allow two operations to occur at the same time.
As shown in
A high speed interface is provided to all banks of the SRAM. The interface accumulates 256 bytes in a buffer, and then transfers all 256 bytes in four clocks to all of the banks. This 256-byte buffer is read or written from the SRAM on 256-byte boundaries. If any vectors are in flight, they are held for one clock while the read or write occurs. The Memory Controller routes each of the potential twelve read or writes from the vector register to the proper banks. Since each vector register may have up to 32 elements, a stride of one assures 32 consecutive banks will be addressed. Since the bank can read or write on every clock there is not a bank conflict between addresses in the same vector, however, there may be bank conflicts due to address conflicts from other vectors that are executing. A single conflict will cause one of the addresses to be delayed by four clocks. The priority is hardwired by vector unit, with vector unit 0 having the highest priority and vector unit 3 the lowest priority. Within each vector unit, load 0 has higher priority over load 1, and the lowest priority is the store operation.
The format of the vadd instruction is:
vadd vVD,vVA,vVB,mM,P,gG
A typical implementation is:
The fields in
Furthermore, in the figures associated with many of the following instructions, reference is made to fields 0x0, 0x1 etc. This nomenclature is intended to indicate that the bits so marked designate hexadecimal 0, hexadecimal 1, etc. In addition, “P” refers to the vector processor pipe number and “G” to the G register.
mvadd vVD,vVA,vVB,mM,gG
This instruction is used on all four pipes at the same time. The arithmetic functional unit is selected by the hardware. Each element of the vector register specified by the VA field 280 is added to the vector element of vector register vVB 281. The result element is placed into the vVD vector register 282. The 3-bit M field 283 selects the vector pipe M register that contains the vector mask registers. If the sum has an overflow, a 1 is placed in the M register. The G field 284 selects the appropriate G register containing the starting element and vector length.
A typical implementation is:
As shown above, the G register is set up by the scalar processor and then used over and over without the necessity of issuing new vector instructions. The G register provides the special attributes needed for execution of the instructions, such as vadd and mvadd. In the case of these instructions the G register provides the vector length and the starting field, thereby providing an indication of how many computations are required and where the addressing starts.
The repeat, skip and stride relate to how an address sequence is generated for vector load and store instructions. The starting address of the first element is computed in the scalar pipe. A stride value is then added to this address and accumulated on every subsequent clock. In addition a skip value is also added to this address stream every nth cycle defined by the repeat field.
The overall impact of the G register is the enablement of a richer opcode set, but without need for long instruction words.
The scalar processor reloads the G register when vector operations occur. The vector operations typically report 32 clocks, thereby providing the scalar processor the opportunity to reload the G register. This capability is enhanced by the vector operation renumbering the contents of the G register when the vector operation begins execution. This enables the G register to be reloaded immediately. The stride feature of the G register is particularly beneficial for video applications in which blocks of pixels from a serial data stream are addressed and processed. The stride allows addressing of the SRAM to step from one location to another where those locations are not contiguous, but are evenly spaced.
The vector processor described above includes many instructions facilitating operations with the G register. These instructions are discussed next.
The “Move One Scalar to G Register (m1sg)” instruction is shown in
m1sg rA,P,gG
For this instruction the vector pipe is selected by the 3-bit P field. Portions of the contents of general register rA are sent to the selected vector pipe and stored in the addressed gG register. General-purpose register A contains the 6-bit repeat and the 16-bit skip. A typical Implementation is:
gG[47:42]<−gG[47:42] (vector length)
gG[41:37]<−gG[41:37] (starting element)
gG[36:31]<−rA[21:16] (repeat)
gG[30:15]<−rA[15:0] (skip)
gG[14:0]<−gG[14:0] (stride)
The “Move Two Immediates to G Register (m2ig)” instruction is shown in
m2ig I,P,gG
For this instruction the vector pipe is selected by the 3-bit P field. The immediate value for the vector length is in bits [16:11] (0x20). The starting element is in bits [25:21] (0x00) of the instruction, and is sent to the vector pipe and stored in the addressed gG register. A typical implementation is:
gG[47:42]<−I[16:11] (vector length)
gG[41:37]<−I[25:21] (starting element)
gG[36:31]<−gG[36:31]
gG[30:15]<−gG[30:15]
gG[14:0]<−gG[14:0]
The “Move Two Scalars to G Register (m2sg)” instruction is shown in
m2sg rA,rB,P,gG
For this instruction the vector pipe is selected by the 3-bit P field. Portions of the contents of the two general registers rA and rB are sent to the selected vector pipe, and stored in the addressed gG register. General-purpose register A contains the 5-bit starting element, and general-purpose register B contains the 6-bit vector length. A typical implementation is:
gG[47:42]<−rB[5:0] (vector length)
gG[41:37]<−rA[4:0] (starting element)
gG[36:31]<−gG[36:31] (repeat)
gG[30:15]<−gG[30:15] (skip)
gG[14:0]<−gG[14:0] (stride)
The “Move Three Scalars to G Register (m3sg)” instruction is shown in
m3sg rS,rA,rB,P,gG
For this instruction the vector pipe is selected by the 3-bit P field. Portions of the contents of the three general registers rA, rB, and rS are sent to the selected vector pipe and stored in the addressed gG register. General-purpose register S contains the 6-bit repeat, and general-purpose register A contains the 16-bit skip. General-purpose register B contains the 15-bit stride. A typical Implementation is:
gG[47:42]<−gG[47:42] (vector length)
gG[41:37]<−gG[41:37] (starting element)
gG[36:31]<−rS[5:0] (repeat)
gG[30:15]<−rA[15:0] (skip)
gG[14:0]<−rB[14:0] (stride)
The “Move Higher G Register to Scalar (mhgs)” instruction is shown in
mhgs rD,P,gG
For this instruction the vector pipe is selected by the 3-bit P field. The high-order 17 bits of the gG register are sent to the scalar general-purpose D register. A typical implementation is:
rD[16:0]<−gG[47:31]
rD[31:17]<−0
The “Move Immediate to G Register (mi(vlg,seg,rg,skg,sg))” instruction is shown in
mi(vlg,seg,rg,skg,sg)I,P,gG
For this instruction the vector pipe is selected by the 3-bit P field. The Stride and Skip Immediate is a 12-bit signed value. (An assembly error will occur if more than twelve bits are specified.) The immediate values as shown in Table 1 are sent to the selected gG register. The MSB of Stride has the sign extended to form a 15-bit value. The MSB of Skip has the sign extended to form a 16-bit value.
A typical implementation is:
Mivl gG[47:42]<−I[19:14]
mise gG[47:37]<−I[18:14]
mir gG[36:31]<−I[19:14]
misk gG[26:15]<−I[25:14]
gG[30:27]<−I[25]
mis gG[11:0]<−I[25:14]
gG[14:12]<−I[25]
The “Multi-Pipe Move Immediate to G Register (mmi(vlg,seg,rg,skg,sg))” instruction is shown in
mmi(vlg,seg,rg,skg,sg)I,gG
For this instruction all vector pipes are selected. The immediate values shown in Table 2 are sent to all vector pipes and the selected gG register. The MSB of Stride has the sign extended to form a 5-bit value. The MSB of Skip has the sign extended to form a 16-bit value.
A typical implementation is:
Multi-Pipes gG<−table Immediate
The “Multi-Pipe Move Scalar Register to G Register (mms(vlg,seg,rg,skg,sg))” instruction is shown in
mms(vlg,seg,rg,skg,sg)rA,gG
For this instruction all vector pipes are selected. The contents of the general-purpose scalar register rA are sent to all vector pipes and the selected gG register. Table 3 describes which bits from general-purpose register rA go to the fields of register gG.
A typical implementation is:
Multi-Pipes gG<−table(rA)
The “Multi-Pipe Move Scalar to Higher G Register (mmshg)” instruction is shown in
mmshg rA,gG
For this instruction all vector pipes are selected. The contents of general register rA are sent to all of the vector pipes and stored in the addressed gG registers. The contents of general-purpose register rA are sent to the selected vector pipe and stored in the upper seventeen bits [47:31] of the addressed gG register. A typical A typical implementation of the instruction is:
gG[47:31]<−rA[16:0]
The “Multi-Pipe Move Scalar to Lower G Register (mmslg)” instruction is shown in
mmslg rA,gG
For this instruction all vector pipes are selected. The contents of general register rA are sent to all of the vector pipes and stored in the addressed G registers. The contents of general-purpose register rA are sent to the selected vector pipe and stored in the lower 31 bits [30:0] of the addressed gG register. A typical implementation of the instruction is:
gG[30:0]<−rA[30:0]
The “Move Scalar Register to G Register (ms(vlg,seg,rg,skg,sg))” instruction is own in
ms(vlg,seg,rg,skg,sg)rA,P,gG
For this instruction the vector pipe is selected by the 3-bity P field. The contents of the general-purpose scalar register rA sent to the selected vector pipe are then sent to the selected gG register. Table 4 shows which bits from the general-purpose register rA go to the fields of register gG.
The “Move Scalar to Higher G Register (mshg)” instruction is shown in
mshg rA,P,gG
For this instruction the vector pipe is selected by the 3-bit P field. The contents of general-purpose register rA are sent to the selected vector pipe and stored in the upper seventeen bits [47:31] of the addressed gG register. A typical implementation of the instruction is:
gG[47:31]<−gG(rA[16:0]
The “Move Scalar to Lower G Register (mslg)” instruction is shown in
mslg rA,P,gG
For this instruction the vector pipe is selected by the 3-bit P field. The contents of general register rA are sent to the selected vector pipe and stored in the lower 31 bits [30:0] of the addressed gG register. A typical implementation of the instruction is:
gG[30:0]<−rA[30:0]
The “Vector Load Byte Indexed (vlbi)” instruction is shown in
vlbi vVD,rA,rB,P,gG
For this instruction the vector data is loaded from the Effective Address (EA) in the SRAM to the specified destination vector register vVD. The index from the contents of general-purpose register rB is added to the contents of general-purpose register rA to form the effective SRAM address. The index (rB) is a signed value, and the base (rA) register is an unsigned value. The byte in memory addressed by the EA is loaded into the low-order eight bits of general-purpose vector register vVD. The high-order bits of general-purpose register vVD are replaced with bit seven of the loaded value. The 3-bit P field contains the pipe number which has a value from 0-3. The upper bit of the P field is reserved for future expansion. The G field is used to select one of eight local registers that contains the values for stride, skip, repeat, the vector starting element, and vector length that will be used for this operation. Each pipe has one G register file. A typical implementation of the instruction is:
The “Vector Load Byte Offset (vlbo)” instruction is shown in
vlbo vVD,rA,O,P,gG
For this instruction the vector byte data is loaded from the Effective Address (EA) in the SRAM to the specified destination vector register vVD and sign-extended. The 6-bit signed offset is sign-extended and shifted left five bit positions, and then added to the contents of general-purpose register rA to form the effective SRAM address. The 3-bit P field contains the pipe number, which has a value from 0-3. The upper bit of the P field is reserved for future expansion. The G field is used to select one of eight local registers that contains the values for stride, skip, the vector starting element, and vector length that will be used for this operation. Each pipe has one G register file. The EA refers to the SRAM. A typical implementation of the instruction is:
The “Vector Load Doublet Indexed (vldi)” instruction is shown in
vldi vVD,rA,rB,P,gG
For this instruction the vector data is loaded from the Effective Address (EA) in the SRAM to the specified destination vector register vVD. The index from the contents of general-purpose register rB is added to the contents of general-purpose register rA to form the effective SRAM address. The index (rB) is a signed value, and the base (rA) register is an unsigned value. The byte in the memory as addressed by the EA is loaded into general-purpose vector register vVD. The 3-bit P field contains the pipe number, which has a value from 0-3. The upper bit of the P field is reserved for future expansion. The G field is used to select one of eight local registers that contains the values for stride, skip, the vector starting element, and vector length that will be used for this operation. Each pipe has one G register file. A typical implementation of the instruction is:
The “Vector Load Doublet Offset (vldo)” instruction is shown in
vldo vVD,rA,O,P,gG
For this instruction the vector data is loaded from the Effective Address (EA) in the SRAM to the specified destination vector register vVD. The 6-bit signed offset is sign-extended and shifted left six bit positions, and then added to the contents of general-purpose register rA to form the effective SRAM address. The 3-bit P field contains the pipe number, which has a value from 0-3. The upper bit of the P field is reserved for future expansion. The G field is used to select one of eight local registers that contains the values for stride, skip, the vector starting element, and the vector length that will be used for this operation. Each pipe has one G register file. The EA refers to the SRAM. A typical implementation of the instruction is:
The “Vector Store Byte Indexed (vstbi)” instruction is shown in
vstbi vVS,rA,rB,P,gG
For this instruction the vector data is sent from the specified vector register vVS to the Effective Address (EA) in the SRAM. The index from the contents of general-purpose register rB is added to the contents of general-purpose register rA to form the effective SRAM address. The 3-bit P field contains the pipe number which has a value from 0-3. the upper bit of the P field is reserved for future expansion. The G field is used to select one of eight local registers that contains the values for stride, skip, the vector starting element, and vector length that will be used for this operation. Each pipe has one G register file. The index (rB) is a signed value, and the base (rA) register is an unsigned value. A typical implementation of the instruction is:
SRAM EA<−(rB[31:0]+rA[31:0]+gG)
SRAM EA [7:0]<−(vVS[7:0])
The “Vector Store Byte Masked Indexed (vstbmi)” instruction is shown in
vstbmi vVS,rA,rB,mM,P,gG
For this instruction the vector data is sent from the specified vector register vVS to the Effective Address (EA) in the SRAM. The index from the contents of general-purpose register rB is added to the contents of general-purpose register rA to form the effective SRAM address. The value in each element vVS is stored in the effective SRAM address only if the corresponding mask bit for that vector element is set to 1. The 3-bit P field contains the pipe number which has a value from 0-3. the upper bit of the P field is reserved for future expansion. The G field is used to select one of eight local registers that contains the values for stride, skip, repeat, the vector starting element, and vector length that will be used for this operation. Each pipe has one G register file. The index (rB) is a signed value, and the base (rA) register is an unsigned value. A typical implementation of the instruction is:
The “Vector Store Byte Masked Offset (vstbmo)” instruction is shown in
vstbmo vVS,rA,O,mM,P,gG
For this instruction the vector data is sent from the specified vector register vVS to the Effective Address (EA) in the SRAM. The contents of general-purpose register rA are added to the offset to form the effective SRAM address. The value in each element vVS is stored in the effective SRAM address only if the corresponding mask bit for that vector element is set to 1. The 3-bit P field contains the pipe number which has a value from 0-3. The upper bit of the P field is reserved for future expansion. The G field is used to select one of eight local registers that contains the values for stride, skip, repeat, the vector starting element, and vector length that will be used for this operation. Each pipe has one G register file. The Immediate (I) is a signed value, and the base (rA) register is an unsigned value. A typical implementation of the instruction is:
The “Vector Store Byte Offset (vstbo)” instruction is shown in
vstbo vVS,rA,O,P,gG
For this instruction the vector data is sent from the specified vector register vVS to the Effective Address (EA) in the SRAM. The signed offset is sign-extended, shifted left six bit positions, and added to the contents of general-purpose register rA to form the effective SRAM address. The 3-bit P field contains the pipe number which has a value from 0-3. The upper bit of the P field is reserved for future expansion. The G field is used to select one of eight local registers that contains the values for stride, skip, the vector starting element, and vector length that will be used for this operation. Each pipe has one G register file. The index (rB) is a signed value and the base (rA) register is an unsigned value. A typical implementation of the instruction is:
The “Vector Store Doublet Indexed (vstdi)” instruction is shown in
vstdi vVS,rA,rB,P,gG
For this instruction the vector data is sent from the specified vector register vVS to the Effective Address (EA) in the SRAM. The index from the contents of general-purpose register rB is added to the contents of general-purpose register rA to form the effective SRAM address. The 3-bit P field contains the pipe number which has a value from 0-3. The upper bit of the P field is reserved for future expansion. The G field is used to select one of eight local registers that contains the values for stride, skip, the vector starting element, and vector length that will be used for this operation. Each pipe has one G register file. The index (rB) is a signed value, and the base (rA) register is an unsigned value. A typical implementation of the instruction is:
The “Vector Store Doublet Masked Index (vstdmi)” instruction is shown in
vstdmi vVS,rA,rB,mM,P,gG
For this instruction the vector data is sent from the specified vector register vVS to the Effective Address (EA) in the SRAM. The index from the contents of general-purpose register rB is added to the contents of general-purpose register rA to form the effective SRAM address. The value in each element vVS is stored in the effective SRAM address only if the corresponding mask bit for that vector element is set to 1. The 3-bit P field contains the pipe number which has a value from 0-3. The upper bit of the P field is reserved for future expansion. The G field is used to select one of eight local registers that contains the values for stride, skip, repeat, the vector starting element, and vector length that will be used for this operation. Each pipe has one G register file. The index (rB) is a signed value, and the base (rA) register is an unsigned value. A typical implementation of the instruction is:
The “Vector Store Doublet Masked Offset (vstdmo)” instruction is shown in
vstdmo vVS,rA,O,mM,P,gG
For this instruction the vector data is sent from the specified vector register vVS to the Effective Address (EA) in the SRAM. The contents of general-purpose register rA are added to the offset to form the effective SRAM address. The value in each element vVS is stored in the effective SRAM address only if the corresponding mask bit for that vector element is set to 1. The 3-bit P field contains the pipe number which has a value from 0-3. The upper bit of the P field is reserved for future expansion. The G field is used to select one of the eight local registers that contains the values for stride, skip, repeat, the vector starting element, and vector length that will be used for this operation. Each pipe has one G register file. The offset (O) is a signed value, and the base (rA) register is an unsigned value. A typical implementation of the instruction is:
The “Vector Store Doublet Offset (vstdo)” instruction is shown in
vstdo vVS,rA,O,P,gG
For this instruction the vector data is sent from the specified vector register vVS to the Effective Address (EA) in the SRAM. The 6-bit signed offset is sign-extended, shifted left six bit positions, and added to the contents of general-purpose register rA to form the effective SRAM address. The 3-bit P field contains the pipe number which has a value from 0-3. The upper bit of the P field is reserved for future expansion. The G field is used to select one of eight local registers that contains the values for stride, skip, the vector starting element, and vector length that will be used for this operation. Each pipe has one G register file. The index (rB) is a signed value, and the base (rA) register is an unsigned value. A typical implementation of the instruction is:
The vector memory system is coupled to a scalar cache 310, also implemented as SRAM. The cache interfaces with the vector memory system over two buses, a 128 bit-wide cache line fill bus 312, and a 32 bit-wide quadlet store bus 314. The cache tags 316 are depicted. There are five external invalidate interface buses 318. Scalar cache 310 is a 4 k byte cache which is four-way set associative. It is a write-through cache with 16 byte lines. In
The data in the 128 banks of SRAM is loaded and unloaded using a double buffered DMA shift register 320. As will be discussed in more detail below, generally, the shift register is loaded and then its contents transferred out in parallel to a buffer. At an appropriate time during operation of the vector memory system, the 256 bytes are loaded into the 128 banks in parallel.
Based upon a control signal provided to it, discussed below, multiplexer 360 selects one of these three sets of input data and provides that set of inputs to the multiplexer 364. Multiplexer 364 enables the retry control, and will select the retry bus 360 if there has been a bank conflict or collision in the address information earlier provided, for example, if successive writes are to the same bank. If there has been no bank conflict, then the information from multiplexer 360 is placed on the bus 340 and provided to stage 0 (342) for determination about whether that bank address falls within the group of banks 0-31 in group 332.
The determination of the priority among the three sets of data provided to multiplexer 360 and multiplexer 364 is hardwired. First priority is always given to retrying information from a previous cycle when a bank conflict has occurred. Second priority is assigned to the DMA controller for reloading the banks of memory, as discussed with regard to
In the upper portion,
The “Multi-Pipe Vector Block Matching Instruction (mvbma)” instruction is shown in
mvbma vVD,vVA,vVB,gG
The mvbma instruction performs a full search block matching operation between the pixel data of a current image block, typically 8×8 pixels, stored in the vector registers vVB and a reference area of the image data, typically 15×15 pixels, stored in vector registers vVA and vVA+1. (Because there is not enough space in the instruction format, register vVA+1 is defined as the next register in the set and is utilized in this manner.)
Both the reference area and current block are stored in vector registers and packed as two pixels per vector register element, each expressed as an 8-bit unsigned value. For execution of the instruction, a fixed vector length of 15 is set in field gG[47:42], and the starting element must be zero. Other numbers produce undefined results. For this instruction, the selected G register file in each pipe must be identical. The reference image data is loaded from sixteen vector registers, vVA and vVA+1 from each of the four pipes. This instruction operates as a multi-pipe instruction. The results of the block matching operation for each block match are stored in registers vVD as described below.
In this instruction, a sum of absolute (SAD) pixel differences is used as the block matching criterion. In this operation, pixels are compared in two images—the current block of pixels and the reference block of pixels—one by one, their difference, e.g. gray level, is calculated and a sum over all differences is returned. Of course other comparison operations may also be used. In implementing the operation, a block comparison of an 8×8 pixel current block stored in register vVB with respect to a reference area of 15×15 pixels stored in vVA and vVA+1 is performed. After a comparison is made at index 0, the current block is shifted one pixel column to the right and a new comparison performed against the reference block at index 1 in the same manner as just described, i.e. for all 64 pixels of the current block. After this comparison, the current block is “moved,” and again compared to the reference block. This process of comparing and shifting is repeated until all of index locations 0-63 have SADs computed and stored in register vVD.
The general approach for determining matching of the current block to the reference block, as well as an index to identify the relative position of the current block with respect to the reference block, for various block comparison locations, is shown in
The operation just described is considered the result for one comparison. There are 64 locations to compare an 8×8 current block of pixels with the 15×15 pixel reference area, and thus there are 64 search locations. For each search location, the SAD of the current block with respect to the reference area at that location is computed and returned to vector registers VD0, VD1, VD2 and VD3.
This instruction requires 15 clock periods to retrieve the reference and current block data from the vector registers. Storing of the results requires 16 clock periods, but cannot start until clock period 8, resulting in a total latency of 24 clocks. The final 8 clocks for storing, however, can be overlapped with the next instruction, yielding an average latency of 16 clock periods. With a reference size of 15×15 the total number of SADs is computed in 24 clocks: ((8×8)×(8×8))/16=256 SADs per clock which results in 192 GigaSAD/sec/vector processor (256*750 MHz).
The second convolver performs the 64 pixel comparisons for each of the eight index locations 8-15; the third convolver for index locations 16-23, etc. Note that the clock periods for the operations are offset by one clock for each subsequent convolver, i.e. the convolvers operate on Clock0-7, Clock1-8, Clock2-9, Clock3-10, Clock4-11, Clock5-12, Clock6-13, Clock7-14. A series of 64 bit registers along the right side of
As shown in
Block0=pixel 0-7
Block1=pixel 1-8
Block2=pixel 2-9
Block3=pixel 3-10
Block4=pixel 4-11
Block5=pixel 5-12
Block6=pixel 6-13
Block7=pixel 7-14
Each of the blocks are overlapped by 7 pixels and shifted to the right by one pixel, hence the convolution. Thus Block0 computes the SAD horizontally on 8 pixels starting with pixel 0. Block 1 computes the SAD horizontally on 8 pixels starting with pixel 1 and so forth. The SAD calculations from each functional unit are then provided to corresponding adders SUM0, SUM1, . . . which compute the sums of the results of the SAD operations, ultimately providing those sums as output signals (to the FIFO shown in
The equations below describe how all of the inputs from the vector registers compute the SAD horizontally on eight bits. For example the Sum of Absolute Differences is described as follows: |(VA_P0[15:8])−(VB_P0[15:8])|, here the absolute value is taken for the difference between vVA and vVB pixels.
8 SAD Arithmetic Units
SAD00[8:0]=|(VA—P0[15:8])−(VB—P0[15:8])|
SAD01[8:0]=|(VA—P0[7:0])−(VB—P0[7:0])|
SAD02[8:0]=|(VA+1—P0[15:8])−(VB—P1[15:8])|
SAD03[8:0]=|(VA+1—P0[7:0])−(VB—P1[7:0])|
SAD04[8:0]=|(VA—P1[15:8])−(VB—P2[15:8])|
SAD05[8:0]=|(VA—P1[7:0])−(VB—P2[7:0])|
SAD06[8:0]=|(VA+1—P1[15:8])−(VB—P3[15:8])|
SAD07[8:0]=|(VA+1—P1[7:0])−(VB—P3[7:0])|
Block0[11:0]=SAD00[8:0]+SAD01[8:0]+SAD02[8:0]+SAD03[8:0]+SAD04[8:0]+SAD05[8:0]+SAD06[8:0]+SAD07[8:0]
SAD11[8:0]=|(VA—P0[7:0])−(VB—P0[15:8])|
SAD12[8:0]=|(VA+1—P0[15:8])−(VB—P0[7:0])|
SAD13[8:0]=|(VA+1—P0[7:0])−(VB—P1[15:8])|
SAD14[8:0]=|(VA—P1[15:8])−(VB—P1[7:0])|
SAD15[8:0]=|(VA—P1[7:0])−(VB—P2[15:8])|
SAD16[8:0]=|(VA+1—P1[15:8])−(VB—P2[7:0])|
SAD17[8:0]=|(VA+1—P1[7:0])−(VB—P3[15:8])|
SAD18[8:0]=|(VA—P2[15:8])−(VB—P3[7:0])|
Block1[11:0]=SAD11[8:0]+SAD12[8:0]+SAD13[8:0]+SAD14[8:0]+SAD15[8:0]+SAD16[8:0]+SAD17[8:0]]+SAD18[8:0]
SAD22[8:0]=|(VA+1—P0[15:8])−(VB—P0[15:8])|
SAD23[8:0]=|(VA+1—P0[7:0])−(VB—P0[7:0])|
SAD24[8:0]=|(VA—P1[15:8])−(VB—P1[15:8])|
SAD25[8:0]=|(VA—P1[7:0])−(VB—P1[7:0])|
SAD26[8:0]=(VA+1—P1[15:8])−(VB—P2[15:8])|
SAD27[8:0]=|(VA+1—P1[7:0])−(VB—P2[7:0])|
SAD28[8:0]=|(VA—P2[15:8])−(VB—P3[15:8])|
SAD29[8:0]=|(VA—P2[7:0])−(VB—P3[7:01)|
Block2[11:0]=SAD22[8:0]+SAD23[8:0]+SAD24[8:0]+SAD25[8:0]+SAD26[8:0]+SAD27[8:0]]+SAD28[8:0]+SAD29[8:0]
SAD33[8:0]=|(VA+1—P0[7:0])−(VB—P0[15:8])|
SAD34[8:0]=|(VA—P1[15:8])−(VB—P0[7:0])|
SAD35[8:0]=|(VA—P1[7:0])−(VB—P1[15:8])|
SAD36[8:0]=|(VA+1—P1[15:8])−(VB—P1[7:0])|
SAD37[8:0]=|(VA+1—P1[7:0])−(VB—P2[15:8])|
SAD38[8:0]=|(VA—P2[15:8])−(VB—P2[7:0])|
SAD39[8:0]=|(VA—P2[7:0])−(VB—P3[15:8])|
SAD310[8:0]=|(VA—+1P2[15:8])−(VB—P3[7:0])|
Block3[11:0]=SAD33[8:0]+SAD34[8:0]+SAD35[8:0]+SAD36[8:0]+SAD37[8:0]]+SAD38[8:0] SAD39[8:0]+SAD310[8:0]
SAD44[8:0]=|(VA—P1[15:8])−(VB—P0[15:8])|
SAD45[8:0]=|(VA—P1[7:0])−(VB—P0[7:0])|
SAD46[8:0]=|(VA+1—P1[15:8])−(VB—P1[15:8])|
SAD47[8:0]=|(VA+1—P1[7:0])−(VB—P1[7:0])|
SAD48[8:0]=|(VA—P2[15:8])−(VB—P2[15:8])|
SAD49[8:0]=|(VA—P2[7:0])−(VB—P2[7:0])|
SAD410[8:0]=|(VA—+1P2[15:8])−(VB—P3[15:8])|
SAD411[8:0]=|(VA—+1P2[7:0])−(VB—P3[7:0])|
Block4[11:0]=SAD44[8:0]+SAD45[8:0]+SAD46[8:0]+SAD47[8:0]]+SAD48[8:0]+SAD49[8:0]+SAD410[8:0]+SAD411[8:0]
SAD55[8:0]=|(VA—P1[7:0])−(VB—P0[15:8])|
SAD56[8:0]=|(VA+1—P1[15:8])−(VB—P0[7:0])|
SAD57[8:0]=|(VA+1—P1[7:0])−(VB—P1[15:8])|
SAD58[8:0]=|(VA—P2[15:8])−(VB—P1[7:0])|
SAD59[8:0]=|(VA—P2[7:0])−(VB—P2[15:8])|
SAD510[8:0]=|(VA—+1P2[15:8])−(VB—P2[7:0])|
SAD511[8:0]=|(VA—+1P2[7:0])−(VB—P3[15:8])|
SAD512[8:0]=|(VA—P3[15:8])−(VB—P3[7:0])|
Block5[11:0]=SAD55[8:0]+SAD56[8:0]+SAD57[8:0]]+SAD58[8:0]+SAD59[8:0]+SAD510[8:0]+SAD511[8:0]+SAD512[8:0]
SAD66[8:0]=(VA+1—P1[15:8])−(VB—P0[15:8])|
SAD67[8:0]=|(VA+1—P1[7:0])−(VB—P0[7:0])|
SAD68[8:0]=|(VA—P2[15:8])−(VB—P1[15:8])|
SAD69[8:0]=|(VA—P2[7:0])−(VB—P1[7:0])|
SAD610[8:0]=|(VA—+1P2[15:8])−(VB—P2[15:8])|
SAD611[8:0]=|(VA—+1P2[7:0])−(VB—P2[7:0])|
SAD612[8:0]=|(VA—P3[15:8])−(VB—P3[15:8])|
SAD613[8:0]=|(VA—P3[7:0])−(VB—P3[7:0])|
Block6[11:0]=SAD66[8:0]+SAD67[8:0]]+SAD68[8:0]+SAD69[8:0]+SAD610[8:0]+SAD611[8:0]+SAD612[8:0]+SAD613[8:0]
SAD77[8:0]=|(VA+1—P1[7:0])−(VB—P0[15:8])|
SAD78[8:0]=|(VA—P2[15:8])−(VB—P0[7:0])|
SAD79[8:0]=|(VA—P2[7:0])−(VB—P1[15:8])|
SAD710[8:0]=|(VA—+1P2[15:8])−(VB—P1[7:0])|
SAD711[8:0]=|(VA—+1P2[7:0])−(VB—P2[15:8])|
SAD712[8:0]=|(VA—P3[15:8])−(VB—P2[7:0])|
SAD713[8:0]=|(VA—P3[7:0])−(VB—P3[15:8])|
SAD714[8:0]=|(VA+1—P3[15:8])−(VB—P3[7:0])|
Block7[11:0]=SAD77[8:0]]+SAD78[8:0]+SAD79[8:0]+SAD710[8:0]+SAD711[8:0]+SAD712[8:0]+SAD713[8:0]+SAD714[8:0]
Another instruction for the vector processor is described next.
The “Convolution FIR Filter (cfirf)” instruction is shown in
cfirf vVD,vVA,vVB,S,R,P,gG,Y
This format defines a three convolution finite impulse response (FIR) filter instruction. The format allows the selection of a 4, 5 or 6 tap filter to be performed on the vVA register by the Y field bits [1:0]. Each of the instructions performs a convolution FIR filter with data in the vVA vector register and up to six 8-bit signed coefficients, stored in the vVB vector register. Each coefficient is loaded into bits [7:0] of the vector register, with coefficient 0 in element 0 and coefficient 5 in element 5.
The vector register specified by the vVA field has one 16-bit signed pixel in each element of the register. There are six MAC units in this functional unit and each MAC unit is shown in
The adder in each of the filters can perform rounding and saturating adds as a function of the R bits[9:8] of the immediate field. The saturating add forces all “ones” when an overflow occurs on an a positive number. If the result of the adder is a negative number the adder is forced to all “zero's”. The final result can be shifted in accordance with the immediate field S [13:10] controls.
Bits [16:1] of the shift and round unit are selected and transferred to the register vVD as shown in Table 6. Table 5 shows which MAC unit is operating on specific elements of the vVA register. For example, for a 6 tap filter, MAC unit 0 operates on doublet [15:0] of elements 0, 1, 2, 3, 4, and 5 in the vVA register and produces one 16-bit result. MAC unit 0 then operates on elements 6, 7, 8, 9, 10, and 11, and produces another result. Selecting a 4 tap filter allows 28 filters in 31 clocks, while a 5 tap filter will allow 25 filters in 29 clocks. A 6 tap filter allows 24 filters in 29 clocks. The results of a 6 tap filter are placed in the vVD vector register as shown in Table 6, other filters have similar repeating output characteristics. The vector pipe is selected by the 3-bit P field. The G field selects the register containing the starting element, which must be zero and the vector length as specified in Table 5.
Number of taps=Y[1:0] (16-bit signed input and output)
0x0=4 taps,
0x1=5 taps,
0x2=6 taps,
0x3=6 taps, used for 16x16 Macroblock
Shift count=(Arithmetic Right Shift) S[13:10]
0x0=no shift
0x1=1, 0x2=2, 0x3=3, 0x4=4, 0x5=5, 0x6=6, 0x7=7, 0x8=8
0x9=9, 0xA=10, 0xB=11, 0xC=12, 0xD=13, 0xE=14, 0xF=15
Round=R[9:8]
0x0=no round
0x1=round and no saturation
0x2=round with 8-bit saturation
0x0=round with 16-bit saturation
A typical implementation of the instruction (for shifting and rounding of MAC units) is:
The “Multi-Pipe Convolution FIR Filter (mcfirf)” instruction is shown in
mcfirf vVD,vVA,vVB,S,R,gG,Y
Like the cfirf instruction, this format defines three convolution FIR filter instructions. The format allows the selection of a 4, 5 or 6 tap filter to be performed on the vVA register by the Y field bits [1:0]. Each of the instructions performs a convolution FIR filter with data in the vVA vector register and up to six 8-bit signed coefficients, stored in the vVB vector register. Each coefficient is loaded into bits [7:0] of the vector register, with coefficient 0 in element 0 and coefficient 5 in element 5.
The vector register specified by the vVA field has one 16-bit signed pixel in each element of the register. There are six MAC units in this functional unit and each MAC unit is shown in
The adder in each of the filters can perform rounding and saturating adds as a function of the R bits[9:8] of the immediate field. The saturating add forces all “ones” when an overflow occurs on an a positive number. If the result of the adder is a negative number the adder is forced to all “zero's”. The final result can be shifted in accordance with the immediate field S [13:10] controls.
Bits [16:1] of the shift and round unit are selected and transferred to the register vVD as shown in Table 6. Table 5 shows which MAC unit is operating on specific elements of the vVA register. For example, for a 6 tap filter, MAC unit 0 operates on doublet [15:0] of elements 0, 1, 2, 3, 4, and 5 in the vVA register and produces one 16-bit result. MAC unit 0 then operates on elements 6, 7, 8, 9, 10, and 11, and produces another result. Selecting a 4 tap filter allows 28 filters in 31 clocks, while a 5 tap filter will allow 25 filters in 29 clocks. A 6 tap filter allows 24 filters in 29 clocks. The results of a 6 tap filter are placed in the vVD vector register as shown in Table 6, other filters have similar repeating output characteristics.
This is a multi-pipe instruction. The G field selects the register containing the starting element which must be zero and the vector length as specified in Table 5.
Number of taps=Y[1:0] (16-bit signed input and output)
0x0=4 taps,
0x1=5 taps,
0x2=6 taps,
0x3=6 taps, used for 16x16 Macroblock
Shift count=(Arithmetic Right Shift) S[13:10]
0x0=no shift
0x1=1, 0x2=2, 0x3=3, 0x4=4, 0x5=5, 0x6=6, 0x7=7, 0x8=8
0x9=9, 0xA=10, 0xB=11, 0xC=12, 0xD=13, 0xE=14, 0xF=15
Round=R[9:8]
0x0=no round
0x1=round and no saturation
0x2=round with 8-bit saturation
0x3=round with 16-bit saturation
A typical implementation of the instruction (for shifting and rounding of MAC units) is:
The “Vector Add & Shift Right Arithmetic & Round Convolution FIR Filter (vaddsrar)” instruction is shown in
vaddsrar vVD,vVA,vVB,C,I,P,gG
The vector pipe is selected by the 3-bit P field. The arithmetic functional unit is selected by the hardware. The vector register specified by the vVA field has each element added to the vector element of vector register vVB. The vVD vector register is shifted right, sign-extending into the lower order bits, with the sign bit remaining in bit [15]. The shift count is controlled by the count in the immediate field I[12:9]. If the C[13] field bit is a “one” and the sum is positive a plus one is added to the LSB−1. If the C[13] field bit is a “one” and the sum is negative a minus one is added to the LSB−1. If C[13] is equal to “zero” or the shift count is “zero” no rounding takes place. The G field selects the register containing the starting element and vector length.
A typical implementation is:
[(K[16])?−C[13]:+C[13] means that if the value of K bit 16 is true, add minus C bit 13, if K bit 16 is false, add plus C bit 13 to K[16:0]. Thus, this is either adding one bit or not to temporary register K[16].]
The preceding has been a description of a preferred embodiment of a vector processor with special purpose register and a high speed memory access system. Although numerous details have been provided for the purpose of explaining the system, the scope of the invention is defined by the appended claims.
This application is a continuation of U.S. application Ser. No. 11/656,143, filed Jan. 19, 2007, which was a continuation-in-part of U.S. application Ser. No. 11/126,522, filed May 10, 2005, entitled “Vector Processor with Special Purpose Registers and High Speed Memory Access,” the entire disclosure of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | 11656143 | Jan 2007 | US |
Child | 11927352 | Oct 2007 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 11126522 | May 2005 | US |
Child | 11656143 | Jan 2007 | US |