1. Field of the Invention
The invention relates in general to a data processor, and more particularly to an integrated data processor, which integrates a plurality of functions of a digital signal processor (DSP) and a microprocessor control unit (MCU).
2. Description of the Related Art
In conventional operating systems concerning a digital signal processor (DSP), most architecture makes use of an independent microprocessor control unit (MCU) to co-operate with an independent digital signal processor to operate in a manner of co-processor. The DSP is generally used as a co-processor to assist the MCU to perform data processing. While the MCU sends out a DSP instruction to control the DSP to execute data operation, the MCU itself also executes its own instruction simultaneously.
In terms of tasks of the MCU and the DSP, the MCU usually works as a controller such as processing interrupts, receiving bit-stream data, and so on. Those received data can be further transferred to the DSP to perform further operations.
However, it is found that the hardware cost of the above-mentioned two independent processors is rather high. Each independent processor further includes respective built-in detailed units, such as an instruction decoder, an operand fetch unit, a calculation unit, and a storage unit. An interface of a data transfer/communication channel between the MCU and the DSP is also required. Therefore, in view of the hardware architecture of the conventional system, a bottleneck exists in reducing the hardware cost.
The present invention provides an integrated data processor, which can support MCU and DSP functions. The integrated data processor of the present invention ensures high performance in data operation efficiency, and also can reduce hardware cost with a system architecture design of the present design.
To achieve an objective of the present invention, the integrated data processor includes an arithmetic unit, an advanced memory parallelism bus (AMPB), and a Y address generator.
The arithmetic unit works as a core unit for performing data calculation. The arithmetic unit is connected to a common data bus and a Y data bus. The common data bus is connected to an X address generator, a data fetch unit and a register unit. Moreover, the Y data bus is connected to an internal Y RAM.
The advanced memory parallelism bus (AMPB) is connected to the data fetch unit and an instruction fetch unit. The AMPB is connected to an internal program ROM/FLASH, an internal X RAM, an external ROM/RAM and a plurality of peripheral devices. The AMPB can use a pipeline operation manner to synchronously process data transmission and fetch an instruction to enhance a parallel operation. Besides, the AMPB further includes a data transfer unit and an interrupt controller. The interrupt controller takes charge to handle operations when an interrupt is requested.
The Y address generator is connected to the register unit and the internal Y RAM.
With the above-mentioned architecture of the present invention, the processor cooperating with a novel instruction set can efficiently execute data computation and instruction operation.
An innovated integrated data processor is provided, which integrates a plurality of functions of a digital signal processor (DSP) and a microprocessor control unit (MCU). A plurality of instructions and a pipeline-process architecture are applied in the present invention, so that a single instruction execution can be completed in a single cycle. An operand can be fetched from a RAM, and a calculation result can be written back to a RAM, so as to greatly enhance operation efficiency of the whole system.
Referring to
The arithmetic unit 10 is connected to an X address generator 20 by a common data bus 11, a data fetch unit 30 and a register unit 40. The arithmetic unit 10 is further connected to an internal YRAM 15 by a Y data bus 13.
An advanced memory parallelism bus (AMPB) 50 includes a data transfer unit 51 and an interrupt controller 52. The AMPB 50 is connected to the data fetch unit 30 and an instruction fetch unit 60. The AMPB 50 can access data via an internal program ROM/FLASH 71, an internal XRAM 72, an external ROM/RAM 73 and a plurality of peripheral devices 74.
A Y address generator 22 is connected to the register unit 40 and the internal YRAM 15.
The instruction fetch unit 60 can fetch instructions from the internal program ROM/FLASH 71, the internal XRAM 72, and the external ROM/RAM via an advanced memory parallelism bus 50. Simultaneously, the data fetch unit 30 also can fetch operand data from any of the RAMs via the advanced memory parallelism bus 50. Hence, when fetching the instructions and data, the advanced memory parallelism bus 50 works to control a data access path, to determine access priority and switch. The advanced memory parallelism bus 50 includes an important feature of being able to fetch the instructions and the data simultaneously, which enhances a parallel operation of the processor.
The instruction fetch unit 60 is further connected with an instruction decoder/control unit 62. The instruction decoder/control unit 62 decodes a coded instruction fetched by the instruction fetch unit 62 and generates a pipeline control instruction.
The present invention uses a set of address RAMs, which are the internal XRAM 72 and the internal YRAM 15. The arithmetic unit 10 can fetch two operands from the internal XRAM 72 and the internal YRAM 15 within a cycle to provide for a multiply-and-accumulate (MAC) calculation. An MAC operation can read one operand from XRAM and another operand from YRAM in parallel, multiply them and accumulate with an AR (accumulator) register.
The X address generator 20 and the Y address generator 22 can generate two addresses simultaneously, so as to provide for the MAC calculation. Moreover, the single X address generator 20 can provide an address instruction for a general MCU operation, which is a 24-bit address for addressing any of the above-mentioned RAMs. This addressing manner for the RAMs of the present invention also can be applied for addressing a register and special function registers.
The X address generator 20 can execute two special functions: one is a circular buffer function, which is very helpful to DSP algorithm, and the other one is a Bit reversal function. On the other hand, the Y address generator 22 also includes the circular buffer function. However, only the X address generator 20 provides the Bit reversal function.
The register unit 40 includes a plurality of general-purpose registers R0˜R4, a plurality of accumulator registers (AR) ARX, ARH and ARL, a plurality of index registers X0˜X2 and Y0˜Y2, a frame pointer and a stack pointer.
The foresaid description provides a brief illustration of the present invention. A detailed introduction of each part of the present invention is as follows.
First: Memory:
Referring to
Program addressing: the present invention can execute programs in the internal program ROM/FLASH 71, the internal XRAM 72, and the external ROM/RAM 73, but cannot execute programs in the internal YRAM 15. Addressing spaces for program codes and data are the same, which are 24-bit addresses. Referring to
Data addressing: referring to
Second: Registers:
Referring to
The foresaid three accumulator registers ARX, ARH and ARL can be used as 40-bit accumulators in multiply and accumulate (MAC) instructions. Moreover, if the accumulator registers are not applied for the MAC instructions, the accumulator registers also can be used as the general purpose registers with the ARX, ARH and ARL mapping to R5, R6, and R7.
An initial value of the stack pointer SP is the last bit address in the internal XRAM 72 when the data of the internal XRAM 72 does not exceed 4 K bytes. For example, if the data of the internal XRAM 72 is 2 K bytes and the last bit address is 07FE, the address 07FE indicates the initial value of the stack pointer SP.
Furthermore, the frame pointer FP is used in a C compiler to allocate a designated address for local variables, so as to speed up the function call and return performance.
Moreover, the present invention also provides several special function registers, which include a system option control register (SOCR), a program status register (PSR), and a stack overflow/underflow register (STOVUN).
1. The system option control register (SOCR): referring to
STKCHK: set this bit to automatically check the stack pointer overflow/underflow.
RAM: set an initial address of 0x0000 for interrupt/trap vectors.
FR: used to set a fraction operation for MUL (multiplication) and MAC instructions. If the FR bit is set, a result of the multiplication operation will be shifted to the left by one bit.
MAS: if the MAS bit is set, a saturation mode will start automatically. When the accumulator is in the saturation mode and also a 32-bit overflow occurs, the accumulator will hold a maximum negative value of FF80000000 or a maximum positive value of 007FFFFFFF according to an overflow direction.
NSEG: the NSEG bit can set to restrict a program code to be smaller than 64 K byte.
WS: this bit is used to set a wait-state number of the external ROM/RAM.
DW (Disable Watch Dog Timer): if this bit is set, a watchdog timer is canceled.
UP: to cancel the protected registers. Some data of registers are write-protect to avoid writing. If desiring to change the write-protect setting of the registers, the write-protect setting has to be canceled first.
IE: an Interrupt Enable bit.
2. Program status register (PSR): referring to
Z: represents a zero flag.
V: represents an overflow flag.
C: represents a carry flag.
N: represents a negative flag.
MV: represents an MAC (Multiply and Accumulate) overflow flag, which indicates the overflow of 40 bits excess in the MAC operation.
MS: an MAC saturation flag, which indicates the saturation in the MAC operation.
CPRI: priority information of the current process.
CIRQ: this bit will be automatically set to 1 by hardware when entering into interrupt service routine or exception handling routine. If CIRQ is 1, the other interrupt requests will be allowed only when the PRI of that interrupt is larger than CPRI. If CIRQ is 0, the other interrupt requests will be allowed when PRI of that interrupt is equal or larger than CPRI.
3. Stack overflow/underflow register (STOVUN): referring to
The processor of the present invention allows execution of a stack operation in any location of the internal ROM/RAM by changing the stack pointer SP to designate a stack address.
The stack overflow/underflow register includes two 8-bit registers: STKOV and STKUN. An addressing manner of the stack overflow/underflow register can be up to 4 K, and a minimum storage unit is 16 bytes, which indicates that a minimum stack capacity is 16 bytes. An upper limit of the stack is STKUN*16, and a lower limit of the stack is STKOV*16. An initial value of the stack pointer can be set as STKUN*16. A stack underflow occurs when the stack pointer is higher than STKUN*16. On the other hand, a stack overflow occurs when the stack pointer is lower than STKOV*16.
Third: MAC Unit and Address Generation Unit (AGU):
MAC Unit: referring to
Address Generation Unit (AGU): as shown in the
Both the X address generator 20 and the Y address generator 22 support three addressing modes: a linear addressing mode, a circular buffer addressing mode, and a bit-reversal addressing mode. Referring to
There are three sets of addressing registers for both of the X address generator 20 and the Y address generator 22. [X0, XM0, XC0], [X1, XM1, XC1] and [X2, XM2, XC2] are the three sets of the addressing registers in the X address generator 20. [Y0, YM0, YC0], [Y1, YM1, YC1] and [Y2, YM2, YC2] are the three sets of the addressing registers in the Y address generator 22. These addressing registers support the above-mentioned three addressing modes, and these addressing modes can be distinguished by the XMn or YMn register. Referring to
1. Linear Addressing:
The linear addressing is the normal addressing mode supporting the MAC instruction. For example,
This instruction multiplies and accumulates two linear array elements each pointed by X0 and Y0. After the multiply-accumulate operation, both X0 and Y0 are incremented by 2. If it is wished to apply this operation to all elements (assume the array size is 256) of these two linear arrays, the following codes can be written:
Some DSP algorithms (for example, FIR algorithm) have fixed coefficients and moving data. After each multiply-accumulate operation, the current data overwrites the previous data. The present invention assumes the data is pointed by the X address generator 20 and the coefficient is pointed by the Y address generator 22. For example,
This instruction will first fetch the data pointed by X0 and the coefficient pointed by Y0. After multiplying these two elements and accumulating the result into the accumulator, the processor of the present invention will keep [X0] data and overwrite to the address (X0−1).
2. Circular Buffer Addressing:
The circular buffer addressing is used to speed up some DSP algorithms with repeated MAC operations. For example, if it is desired to declare a 16-word circular buffer, the following instruction can be used:
Label0: .CIRCBUF 0x10.
The instruction defines the 16-word circular buffer that the base start address should be k*25.
This instruction will allocate 16 words (32 bytes) in ram. The base address of the circular buffer will be automatically allocated to an address of k*(2n), where 2n>=0x20 and k is any integer number. The upper bound of the buffer will be k*(2n)+0x20−1.
For example, if the start address is the 5th word in the buffer, the following instruction is used to perform MAC operations on the circular buffer Label0.
MOV X0, #Label0+10; (5 word*2=10 bytes).
MOV XM0, #0x20; the buffer length is 16 words (0x20 bytes).
REP #0x2D; proceed 0x2E times of the next instruction.
MAC.uu [X0++], [Y0++]
These codes will perform 0x2E times of MAC operations on the circular buffer Label0 starting from #Label+10 address. After each operation, the X0 will be incremented by 2 automatically. When X0>=#Label0+0x20, X0 will be wrapped around to #Label0+(X0−(#Label0+0x20)).
3. Bit Reversal Addressing:
The bit-reverse addressing logic is mainly used in FFT algorithms. This mode is available only on addresses generated from the three sets of addressing registers for both AGU and the value of XMn or YMn is 0xFFFF.
The bit-reversed address is derived from reversing the bit order of an address. For example, if the address of a 32-word buffer is as the form k9k8k7k6k5k4k3k2k1k0b5b4b3b2b1b0. The bit-reversed address will be the form k9k8k7k6k5k4k3k2k1k0b0b1b2b3b4b5. Note that the six least significant bits order is reversed.
Fourth: Barrel Shifter:
Referring to
Fifth: Interrupt and Exception Handling:
There are three types of event sources that will make the processor of the present invention suspend current execution and branch to service routine. The 7 first event source is called “interrupt” which is generated from the peripherals, e.gs. timers, I/O ports, serial interfaces, A/D converters, etc. The second event source is exception which is generated during the program execution. Exception may not be in the expectation handling of a programmer when writing the program. For example, an invalid instruction, an invalid address, stack overflow, to divide by zero, etc. The third event source is explicitly written in a program as an instruction form. Users can use “Trap” instructions to generate software interrupt. The instruction will be processed in the same manner as occurs with hardware interrupt. Users may also set a bit in Interrupt Control Register (ICR) to make an interrupt request as hardware made to generate interrupt.
The processor of the present invention can support up to 32 interrupt sources. There are three sets of registers to control the interrupt behavior. Interrupt mask registers are used to enable/disable interrupts. Interrupt pending registers are used to indicate the request status of interrupts. Interrupt level registers are used to prioritize the interrupts.
Referring to
EN: Interrupt Enable. Set this bit enabling the interrupt request to be processed.
RQ: Interrupt Request. This bit indicates the respective interrupt request has occurred and is pending. The bit is set by hardware if the interrupt occurs and will be cleared automatically by hardware when entering the respective interrupt service routine or the interrupt is processed by the data transfer unit 51. These two registers can be read or written by software.
ED: Enable DTU Processing. This bit enables the DTU 51 to process the interrupt request and transfer the data pointed by SRCPx to destination address pointed by DSTPx. If the ED bit is set, the PRI will represent the DTU channel used by this interrupt.
PRI: Interrupt Priority. The present invention supports four levels of interrupt priority. The value of PRI is 0˜3. A bigger number is represented as higher priority. When an interrupt request occurs, the present invention will compare the PRI with CPRI in the PSR register. If PRI is higher than CPRI, the interrupt request is accepted and the current process will be suspended. If PRI is not higher then CPRI, the interrupt request will be pending and kept on set. If there are many interrupt-requests coming in the same cycle, the priority of them will be compared and let a highest priority interrupt to be serviced.
Exception: the exception is generated while executing a program and some events occur. The exception handling mechanism can help programmers to create more robust program codes and can help to debug the program.
STU: Stack Underflow.
STO: Stack Overflow.
IIO: Invalid instruction or error format of operands.
IWA: Invalid word access address, which fetches the operands from the invalid word address.
IAE: Invalid Address Error, which fetches an undefined address.
IIA: Invalid Instruction Address.
IREP: Illegal Repeated Instruction.
DB0: Divide By Zero.
The exception operation is similar to the interrupt operation, but there are some differences. Firstly, when exception happens, the current instruction (the instruction in decode stage) address will be pushed into the stack, which is different with the interrupt operation. The interrupt operation will push the next instruction (not yet decoded instruction) address into the stack. Second, before entering the exception routine, the CIRQ bit in the PSR register will be set to 1 and the CPRI will be set to 11 (the highest priority). The interrupt operation will copy the PRI bits from ICR to PSR whereby it can be known that the exceptions have higher priority than all interrupts. The priority between different exceptions depends on the exception vector. A lower value of the vector represents that the exception has the higher priority.
Sixth: Data Transfer Unit (DTU):
The data transfer unit 51 is included in the advanced memory parallelism bus (AMPB) 50. The data transfer unit 51 is capable of catching an interrupt and transferring a word or a byte from a preset source memory address to a preset destination memory address in one cycle. The data transfer unit 51 can automatically increment a source address pointer or destination address pointer after transferring the data.
There are 4 DTU channels in the DTU 51 which means at most 4 different interrupts can be assigned to the DTU 51. When an interrupt is assigned to the DTU 51 and the DTU channel's count is not zero, the interrupt service routine will not be activated. The interrupt request will cause the DTU channel n to get a word or a byte data located in the address stored in SRCPn. Then the DTU channel n stores that data into the address stored in DSTPn.
Referring to
To enable the DTU 51, the ED flag in interrupt control register xxICR should be set and the PRI field in xxICR should be set to the channel number of the DTU 51. The following section describes fields of the DTU control register DTUCx
COUNT: counts DTU transfers.
WBT: Word/Byte transfer selection. Cleared to select word transfer mode and set it to select byte transfer mode.
SDS: if the SDS field is set, SRCPx is enabled to combine DS0/DS1 to calculate and generate a source address data. If the SDS field is not set, the source address equals to SRCPx.
DDS: if the DDS field is set, DSTPx is enabled to combine DS0/DS1 to calculate and generate a source address data. If the DDS field is not set, the source address equals to DSTPx.
INC: Increment Control. To control the modification of SRCPx and DSTPx. Referring to
The COUNT field in DTUCx will be decremented by 1 after every DTU transfer. When the count field becomes 0, an interrupt will happen and enter interrupt service routine. The interrupt can process the transferred data, adjust the source/destination pointer and set the COUNT field again. After returning from the interrupt service routine, the DTU 51 will be activated again and continue to transfer data when the interrupt event occurs.
Referring to
Seventh: External Memory Interface:
The external memory interface for the external ROM/RAM can support standard ROM, EEPROM, SRAM, NOR Flash Memory, and NAND Flash Memory. The connected external memory can be addressed to be larger than 16 MB. Because a dynamic external memory control register (EMCR) is used, the EMCR can control the start address of any memory. Currently three EMCRs are used.
Referring to
Referring to
The second segment address boundary is defined with (EMCR1.EMBLK)*0x10000 as the lower boundary and (EMCR2.EMBLK)*0x10000 as the upper bound address. The wait state of the second segment is EMCR1.WS. The third segment address boundary and the fourth segment address boundary are thereby analogized.
Eighth: Clock Generation and Operation Modes
The present invention provides three operation modes: a normal operation mode, an idle mode and a sleep mode. The normal operation mode operates with a system clock. A CLKSEL field of a clock control register (CLKCON) determines the frequency of the system clock. The system clock is in an off state in the idle mode and the sleep mode.
Referring to
Ninth: Timers:
There are three Timers and one Time Base Clock (TB) in the present invention. Each Timer has one control register (TxC), one preload register (TxP) and one timer counter register (Tx).
The Time Base Clock is a simple timer that generates a 2 Hz˜32768 Hz interrupt signal. There is a time base control register (TBC) to control the frequency of the time base clock.
Tenth: I/O Ports and External Interrupts:
The present invention provides 64 I/O (input/output) pins, which are categorized into four groups: P0.0˜P0.15, P1.0˜P1.15, P2.0˜P2.15, and P3.0˜P3.15. Four of the I/O pins are used as external interrupt pins: P0.15, P0.14, P0.13 and P0.12 as INT0, INT1, INT2 and INT3.
As in the foresaid description, the present invention provides the pipelined architecture capable of executing the single instruction in the single cycle. The detailed pipeline operation is described as follows.
Referring to
The first phase: fetching the instruction from a RAM or an instruction buffer.
The second phase: decoding the instruction and also simultaneously computing an operand location to fetch the operand from the memory if required. If the operand is stored in an address pointed by a register, which is an indirect addressing, the register is read to compute the location of the operand. In some addressing modes, the register is allowed to execute a post increment or pre decrement.
The third phase: the arithmetic unit 10 performing the calculation on the operand according to the instruction.
The fourth phase: a calculation result of the third phase is written to a target memory address.
Branch Instruction Processing:
The processor of the present invention includes three types of transfer instructions. The first type of the transfer instructions is an unconditional transfer instruction such as SJMP and SCALL. The second type of the transfer instructions is a conditional transfer instruction such as SJMP, SCALL and JB. The third type of the transfer instructions is a special repeat operation to execute a loop in zero overhead.
Referring to
Referring to
The instruction fetch unit 60 immediately fetches the next instruction after fetching the conditional transfer instruction. Therefore, the next instruction will be decoded directly if there is no transfer action, as shown in
Referring to the appendix “Instruction set”, the instruction set of the processor of the present invention is shown. The instruction set can be separated as arithmetic, shift/rotate, bit operation, branch, comparison, data movement and MISC instructions. The processor of the present invention can achieve the best performance by the original and novel instruction set and the above-mentioned hardware architecture.
To conclude, the processor of the present invention includes features as follows:
First, high performance 4-stage pipelined architecture capable of executing the MCU or a DSP instruction in a single cycle.
Second, the single cycle MAC instruction execution with data movement ability can optimize the FIR algorithm.
Third, bit reversal function is available to optimize the FFT algorithm.
Fourth, repeat instruction can repeat the single instruction many times, so as to effectively simplify the instruction composing.
Fifth, the automatic stack overflow/underflow detection can avoid complicated stack check and support unlimited stack structure.
Sixth, the present invention supports Word, Byte and Bit operations to more powerfully meet MCU control.
Therefore, the integrated data processor of the present invention includes novelty and obviously improves the performance of the conventional data processor.
While the invention has been described by way of example and in terms of a preferred embodiment, it is to be understood that the invention is not limited thereto. On the contrary, it is intended to cover various modifications and similar arrangements and procedures, and the scope of the appended claims therefore should be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements and procedures.