The present invention relates to the field of data processing. More specifically, the present invention relates to data processing using a set of processing elements with a global file register and global predicates.
Computing workloads in the emerging world of “high definition” digital multimedia (e.g. HDTV and HD-DVD) more closely resembles workloads associated with scientific computing, or so called supercomputing, rather than general purpose personal computing workloads. Unlike traditional supercomputing applications, which are free to trade performance for super-size or super-cost structures, entertainment supercomputing in the rapidly growing digital consumer electronic industry imposes extreme constraints of both size and cost.
With rapid growth has come rapid change in market requirements and industry standards. The traditional approach of implementing highly specialized integrated circuits (ASICs) is no longer cost effective as the research and development required for each new application specific integrated circuit is less likely to be amortized over the ever shortening product life cycle. At the same time, ASIC designers are able to optimize efficiency and cost through judicious use of parallel processing and parallel data paths. An ASIC designer is free to look for explicit and latent parallelism in every nook and cranny of a specific application or algorithm, and then exploit that in circuits. With the growing need for flexibility, however, an embedded parallel computer is needed that finds the optimum balance between all of the available forms of parallelism, yet remains programmable.
Embedded computation requires more generality/flexibility than that offered by an ASIC, but less generality than that offered by a general purpose processor. Therefore, the instruction set architecture of an embedded computer can be optimized for an application domain, yet remain “general purpose” within that domain.
The present invention is a stream processing accelerator which includes multiple coupled processing elements which are interconnected through a shared file register and a set of global predicates. The stream processing accelerator has two modes: full-processor mode and circuit mode. In full-processor mode, a branch unit, an arithmetic logic unit and a memory unit work together as a regular processor. In circuit mode, each component acts like functional units with configurable interconnections.
A stream processing accelerator includes n processing elements (PEs), m registers organized as a global file register (GFR) used to exchange data between PEs and p global predicates used by the PEs as condition bits. Of the global predicates, one is selected by each PE and is available to the other PEs, while the rest of the global predicates are set by explicit instructions by any PE.
With multiple PEs communicating with the multiple registers within the GFR, it is possible to execute various instructions on data, thus providing a more efficient processing unit. Any PE can read/write to any of the registers within the GFR, providing flexibility as well.
Each PE is a two stage pipeline machine: fetch and decode; execute and write back. Each PE contains a local file register, an Arithmetic Logic Unit (ALU), a Branch Unit (BU), a Memory access Unit (MU), a program memory and a data memory.
Each PE can be configured to function in two different modes: full-processor mode or circuit mode. The method of changing modes preferably includes toggling a register bit. The modes are able to come pre-configured or configured later. Furthermore, since each PE is able to be configured independently, it is possible to have some PEs in full-processor mode and some in circuit mode.
In full-processor mode, the BU, the ALU and the MU work together as a regular processor. Furthermore, the PEs are able to work as a pipeline where some or all of the PEs are interconnected so that each PE uses data generated by the previous PE.
In circuit mode, each component acts like a functional unit with configurable interconnections. ALUs are used to implement the logic, MUs implement look-up tables, BUs implement state-machines, operand registers store the state of the circuit, instruction registers are configuration registers for BU, ALU and MU and special function registers provide an I/O connection.
By reading and writing in a specific order, the stream processing accelerator 100 can act like a pipeline. For example, the stream processing accelerator 100 can be configured such that PE0 writes to register, R0, and R0 reads from PE1 then PE1 writes to register, R1, and R1 reads from PE2, and so on. The last register, Rn, wraps around and reads from the first PE, PE0. Thus, even sequential data is able to be processed efficiently via a pipeline.
The global predicates 106 used within the stream processing accelerator are preferably 1-bit flip-flops. Preferably, there are more global predicates 106 than PEs 104. For example, the stream processing accelerator 100 with 8 PEs 104 and 8 registers in the GFR 102 could have 32 global predicates 106. The first n global predicates are individually associated to each PE, where n is the number of PEs, such as 8. The other global predicates are set and/or tested by any PE in order to decide what action to take. For example, if a program has a branch and needs to compute the value of c[0] to determine which branch to take, a global predicate is able to be set to the value of c[0], and then the PEs that need to know that value are able to execute based on the value read in the global predicate. This provides a way to implement the efficient processing system as described in U.S. patent application Ser. No. ______, entitled “INTEGRAL PARALLEL MACHINE”, [Attorney Docket No. CONX-00101] filed ______, which is hereby incorporated by reference in its entirety.
An additional mode of the PEs is tree mode which is accessible in full-processor mode. Utilizing the present invention, a PE is able to solve a very unbalanced tree. Tree mode is dedicated to Variable Length Decoding (VLD), and an example of VLD is Huffman coding. Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol. In tree mode, the PE uses a different set of instructions optimized for fast bit processing. The PE will continuously read bits from a bit queue and advance in the VLD state tree until a terminated state is entered (meaning that a complete symbol was decoded). From a terminal state, the PE re-enters the full-processor mode, leaving a result value in a register.
The following is an exemplary VLC table:
During tree mode, a 32-bit instruction is divided into 4 sub-instructions, each having 1 byte. Based on the value of the top bits of a bit queue, one of the 4 sub-instructions will be executed. The number of bits read from the bit queue and the function used to select the sub-instruction are specified by a state register only used in the tree mode.
A bit is used to test and find an end result or state. The state result may be found in 1 clock cycle as in the left branch of the tree in
In an exemplary embodiment, the GFR includes 8 16-bit registers shared by all 8 PEs. If one or more PEs are in circuit mode, then each individual ALU or MU can access the GFR. A write to the GFR requires passing data through an additional pipeline register, so writes to the GFR are performed 1 clock cycle later than local file register writes. Local file register writes are performed in the execute stage, while GFR writes are performed in the write-back stage.
Although any individual PE (or any ALU/MU in circuit mode) can access the global file register, there are some restrictions on the number of simultaneous accesses permitted. From each PE in circuit mode, only one of the two units (ALU and MU) is allowed to write in the global file register at any given time. In case of a conflict, only MU will write. The restriction does not apply to the full-processor mode because full-processor mode instructions only have one result. For each PE in circuit mode, an ALU left operand register and an MU address register cannot be both global registers. For each PE in circuit mode, an ALU right operand register and an MU data register (for STORE operations) cannot be both global registers.
As described above, the global predicates are used by branch units executing branch instructions. A branch instruction can test up to 2 predicates at a time in order to decide if the branch is taken. The predicates include 6 flags from each PE and 16 global flags. The global flags can be modified by any PE using set and clear instructions.
To utilize the present invention, a set of PEs is coupled to a GFR and global predicates for processing data efficiently. The present invention is able to implement PEs in two separate modes, full-processor mode and circuit mode. In addition to setting a mode, the configuration of PEs is also modifiable. For example, a first subset of PEs is set to circuit mode and a second subset of PEs are set to full-processor mode. Additionally, subsets can be set to full-processor mode or circuit mode with equal or different numbers of PEs in each subset. After the mode and configuration are selected, or pre-selected, the present invention processes data accordingly by reading and writing to the GFR.
In operation, the present invention processes data using the PEs, GFR and global predicates. The PEs read from and write to the GFR in a manner that efficiently processes the data. Furthermore, the global predicates are utilized when branch instructions are encountered wherein a PE determines the next step based on the value in the global predicate.
There are many uses for the present invention, in particular where large amounts of data is processed. The present invention is very efficient when processing long streams of data such as in graphics and video processing, for example HDTV and HD-DVD.
The present invention has been described in terms of specific embodiments incorporating details to facilitate the understanding of principles of construction and operation of the invention. Such reference herein to specific embodiments and details thereof is not intended to limit the scope of the claims appended hereto. It will be readily apparent to one skilled in the art that other various modifications may be made in the embodiment chosen for illustration without departing from the spirit and scope of the invention as defined by the claims.
This patent application claims priority under 35 U.S.C. §119(e) of the co-pending, co-owned U.S. Provisional Patent Application No. 60/841,888, filed Sep. 1, 2006, and entitled “INTEGRAL PARALLEL COMPUTATION” which is also hereby incorporated by reference in its entirety. This patent application is related to U.S. patent application Ser. No. ______, entitled “INTEGRAL PARALLEL MACHINE”, [Attorney Docket No. CONX-00101] filed which is also hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
60841888 | Sep 2006 | US |