This application claims priority under 35 U.S.C. §119 from European Patent Application No. 09164014.4 filed Jun. 29, 2009, the entire contents of which are incorporated herein by reference
The present invention relates to digital processors operable to execute program instructions for processing and/or streaming data. Moreover, the invention concerns methods of processing and/or streaming data in these digital processors.
Referring to
Advances in silicon integrated circuit fabrication, for example achieved using dry-etching fabrication techniques, ion implantation and short-wavelength optical lithographic processes, it has now become feasible to integrate multiple digital processors together onto a single silicon integrated circuit by employing circuit feature dimensions of 100 nm or less, for example 65 nm. As a consequence of such miniaturization, transistor switching speeds have increased dramatically whereas signal propagation delays occurring along interconnects employed within the integrated circuit have not reduced in proportion. Clock speeds of 1 GHz or more are now feasible in such integrated circuits. Changes in interconnect materials from aluminium to copper has provided some reduction in propagation delay, but does not fundamentally address this problem of interconnect propagation delays being significant in comparison to transistor switching speeds.
Integrated circuit designers have therefore evolved contemporary processor design as illustrated in
In a contemporary state-of-the-art general purpose microprocessor, the FU's 120 are fabricated so that, within a given cluster 110, they are completely separate in relation to their associated register files containing register contents. A disadvantage of such a configuration is that more than one hardware cycle is required to transfer data between an FU 120 and its associated register file, for example by way of a pipeline architecture executing multiple steps r1, r2 and r3: step r1 involves transferring data from the register to the FU 120, step r2 involves executing a function on the data at the FU 120 to generate processed data, step r3 involves moving the processed data from the FU 120 to the register. In a published research paper “AMD's Mustang versus Intel's Willamette”, there is described in overview an alleged single cycle arithmetic logic unit (ALU), for example an FU 120 which is imbedded between two staging registers. However, such a configuration still requires data to be transferred from the register file to an input register of the ALU and therefore is, in practice, not genuinely a single cycle arithmetic unit (ALU). The need to perform several cycles presently represents a limitation to a speed of processing achievable using a contemporary state-of-the-art microprocessor.
It is an object of the invention to provide to increase processing speeds of microprocessors by reducing a number of cycles required for transferring data within the microprocessors.
This object is achieved by the features of the independent claims. The other claims and the specification disclose advantageous embodiments of the invention.
According to a first aspect of the present invention, there is provided a processor subunit for a processor for processing data, wherein the subunit includes:
The invention is of advantage in that the processor subunit is capable of functioning at an enhanced rate for processing data.
Particularly, the functional units themselves are free of internal registers. The multiplexors advantageously allow for addressing the desired register so that there is no need for separate read and write ports at each register.
Optionally, the input multiplexors form a cross bar switch, wherein the multiplexors are connected by wires.
Optionally, the output of each functional unit is connected to one or more other registers, preferably to all other registers.
Optionally, registers connected to the input of the at least one functional unit are writable from an output of at least one other functional unit.
Optionally, registers connected to the input of the at least one functional unit are writable from at least one other register.
According to a second aspect of the present invention, there is provided a processor for processing data, said processor including at least one functional unit FU for executing instructions on data, comprising a processor subunit according to any described-above feature.
Optionally, there is provided a processor for processing data, said processor including at least one functional unit (FU) for executing instructions on data, wherein the at least one functional unit (FU) has at least one register associated therewith, said register being operable to hold one or more addresses of one or more registers associated with the at least one functional unit (FU), said one or more registers being addressed by the instructions for providing a direct any-to-any connection between the one or more registers associated with the at least one functional unit (FU), thereby providing a single cycle data path between the at least one functional unit (FU) and its associated one or more registers.
The invention is of advantage in that the processor is capable of functioning at an enhanced rate for processing data.
Optionally, the processor includes a plurality of functional unit (FU), each functional unit (FU) being provided with one or more associated registers, and the processor further comprising one or more buses from at least a sub-set of the functional units (FU) to any of the registers.
Optionally, in the processor, one or more registers operable to store operands served their associated functional units (FU) directly for reducing bypass overheads.
Optionally, the processor is fabricated into an integrated circuit concurrently with a cache memory, streaming logic and a controller coupled to the processor, wherein the integrated circuit is operable to function as a programmable streaming accelerator. More optionally, the controller is coupled to a same nest-frequency clock as the processor.
Optionally, in the processor, the controller is a BaRT-controller which is operable to reconfigure said streaming accelerator in response to receiving reconfiguring instructions.
More optionally, in the processor, the controller is operable to employ three states of “0”, “1” and “don't care” for enabling stating transitions within the streaming accelerator to be achieved without branches.
According to a third aspect of the invention, there is provided a programmable streaming accelerator comprising a processor pursuant to the second aspect of the invention fabricated into an integrated circuit concurrently with a cache memory, streaming logic and a controller coupled to the processor, wherein the integrated circuit is operable to function as the programmable streaming accelerator.
According to a fourth aspect of the invention, there is provided a method of operating a programmable streaming accelerator, pursuant to the third aspect of the invention, the method including:
It will appreciated that features of the invention are susceptible to being combined in any combination without departing from the scope of the invention.
The present invention together with the above-mentioned and other objects and advantages may best be understood from the following detailed description of the embodiments, but not restricted to the embodiments, wherein:
In the drawings, like elements are referred to with equal reference numerals. The drawings are merely schematic representations, not intended to portray specific parameters of the invention. Moreover, the drawings are intended to depict only typical embodiments of the invention and therefore should not be considered as limiting the scope of the invention.
In numerous contemporary electronic systems, there is a need to provide complex streams of processed data, for example in multimedia systems, Internet-coupled apparatus and so forth. The need is addressed by various types of data server architecture which have over time evolved from single-processor servers to multiprocessor servers supporting various software applications for providing database, application and web services in heterogeneous customer environments. Such electronic systems are operable to process various data streams at high speed for achieving efficient data communication. One bottleneck encountered in contemporary state-of-the-art server platforms is limited communication bandwidth for processing data streams between various client software applications.
A contemporary solution to such limited communication bandwidth is to employ dedicated input/out hardware in servers, for example acceleration hardware. Alternatively, another solution is to employ an external interface to interface to processor units of multiprocessor systems. Examples of these contemporary solutions are to be found in IBM's proprietary p- and z-Series server platforms utilizing a proprietary Infiniband architecture for high bandwidth communication. This contemporary architecture exhibits microsecond latencies when handling data streams which is presently acceptable. However, it is desirous in future server platforms that sub-microsecond latencies be achieved. Existing processor designs unfortunately do not allow for sub-microsecond latencies to be achieved.
Contemporary open system architectures utilize various protocols for providing high-speed data stream processing. Such protocols include well-known Infiniband, TCP/IP and Hypertransport. These architectures are also required to perform other functions such as parsing of XML documents, data compression and data encryption. For providing high-bandwidth data processing with low latencies requires aforementioned acceleration hardware because generic contemporary processor architectures are not optimized for executing high-speed data stream processing and other related functions. Thus, contemporary solutions for providing high-speed data stream processing involves use of hardware implementations which are individually designed and adapted for dedicated applications, for example data compression and/or data decoding.
The present invention seeks to increase processing speeds of microprocessors by reducing a number of cycles required for transferring data within the microprocessors. This reduction in cycles enables a universal programmable streaming accelerator (UPSA) to be realized which provides high-speed data processing with low latency. In
The BaRT-controller 220 is described in a published US patent application no. US 2005/0132342 which is hereby incorporated by reference. In the U.S. patent application, there is described an XML parsing system including a pattern-matching system for receiving an input stream of characters corresponding to the XML document to be parsed. The pattern matching system includes two main components: a controller operable to function as a programmable state machine programmed with an appropriate state transition diagram, and a character processing unit operable to function as token and character handler. The programmable state machine is also operable to search for a highest-priority state transition rule using a variation of a BaRT algorithm as described in J. van Lunteren, “Searching very large routing tables in wide embedded memory,” Proceedings of the IEEE Global Telecommunications Conference GLOBECOM'01, vol. 3, pp. 1615-1619, San Antonio, Tex., November 2001.
The UPSA 200 is operable to process various data streams whilst providing high bandwidth and low latency. Moreover, the UPSA 200 is beneficially fabricated onto a single silicon die. The USL 210 and the BaRT-controller 220 constitute an hardware accelerator for speeding up streaming applications, for example network protocol processing, XML-parsing and compression. The UPSA 200 is beneficially coupled to a high nest frequency of the processor core 230. For providing processing incoming and outgoing data steams in parallel, a plurality of the UPSA 200 can be employed. The hardware accelerator is capable of providing benefits of increased data processing speeds, universality and flexibility in respect of different streaming tasks on account of re-programmability of the BaRT-controller 220. The BaRT-controller 220 is configured by loading a program into the controller's memory; such loading of the program can be undertaken at any time without rebooting the UPSA 200. For example, it is possible to execute network protocol processing first and then subsequently switch the UPSA 200 to execute XML-parsing.
The USL 210 is operable to process incoming data by employing a method comprising:
Loading and storing data is performed by dedicated logic operable to handle data transfers between:
When data streams of specific applications are to be transferred merely between the cache memory 240 and the USL 210, the streaming buffer 250 can be used as additional memory, for example in a manner of a stack.
Access to a main memory coupled to the UPSA 200 is performed by a cache controller 300 of the processor core 230. Thus, from a viewpoint of the UPSA 200, memory access is processed in a similar manner to cache memory access, namely in a transparent manner. Such a manner of data access enables the UPSA 200 to access data in its cache memory 240 in a very efficient manner. Moreover, cache coherency between different processors 230 is beneficially provided by the microprocessor's cache controller 300. Moreover, the USL 210 is beneficially provided with an interface 310 to an address translation unit (ATU) 320 of the processor core 230 for translating virtual addresses into corresponding physical addresses.
The UPSA 200 beneficially also includes a parallel arithmetic logic unit (ALU) 260 comprising one or more general purpose registers (GPR) 330 together with one or more multiply fully-independent arithmetic units for executing operations, for example additions, subtractions, shift operations and comparison of both 32-bit and 64-bit wide values for modifying and comparing data. In operation, data from the cache memory 240 or from the streaming interface can be loaded into the one or more general purpose registers (GPR) 330 and vice versa. Data stored in the general purpose registers (GPR) 330 can be optionally employed as operands in arithmetic operations or as addresses for both streaming operations or access to the cache memory 240.
In
Referring next to the BaRT-controller 220, the BaRT-controller 220 is based upon a programmable finite state machine (P-FSM). The BaRT-controller 220 provides for multiple branches and thereby enables increased speed to be achieved in comparison to processors which can branch only once per cycle. Multiple branches accommodated by the BaRT-controller 220 are limited by a size of the P-FSM's transition rule memory. Moreover, the BaRT-controller 220 employs a hash-algorithm which encodes and distributes the transition rules in a selective and targeted manner, thereby saving address space. The BaRT-controller 220 only allocates as much memory as there are transition rules in contradistinction to standard P-FSM which allocate memory for all possible input and output vector combinations. As a result of lower memory consumption, the BaRT-controller 220 in the context of the present invention enables a combination of a universal logic and programmable finite state machine in a practical manner feasible manner.
The BaRT-controller 220 as described in an aforementioned published US patent application no. US 2005/0132342 which is hereby incorporated by reference. The BaRT-controller 220 employs ternary input vectors, namely “0”, “1” and “don't care”, such that state transitions without branches can be made easily by merely applying a “don't care” input. The BaRT-controller 220 is coupled to the GHz clock of the processor core 230 for ensuring greatest operating speed. Fast direct access is provided to both the cache memory 240 and the streaming interface for reducing latencies as they occur to peripheral interconnects.
By reprogramming the BaRT-controller 220, the UPSA 200 can be adapted to various different data streaming applications; such reprogramming is beneficially achieved without a need to reboot.
As aforementioned, the UPSA 200 provides for processing of various data streams whilst providing high bandwidth and low latency. For example, the UPSA 200 is capable of being used to provide a universal method of efficiently processing data streams. The UPSA 200 provides efficient parallel processing support for data streams by using the aforesaid BaRT concept to control arithmetic and logic units of the USL 210. Very long instruction words (VLIW) are employed to directly control different functions of the USL 210 in parallel. The evaluation of conditions to branch into different process steps are also executed in parallel in the UPSA 200.
The UPSA 200 employs a configuration which enables close proximity to hierarchy of the cache memory 240 so that latencies for processing of streamed data are reduced in comparison to contemporary state-of-the-art data processing systems. Moreover, in the UPSA 200, the BaRT-controller 220 employs ternary input vectors, namely “0”, “1” and “don't care” states, which allows for more efficient program code to be employed when the UPSA 200 is in operation. Optionally, an array of BaRT-controllers 220 can be employed to function in parallel, namely synchronize and mutually communicate via a set of registers or data memory. Each of the BaRT-controllers 220 can be used to control a different selection of functional units or functions within the UPSA 200.
The UPSA 200 is thus beneficially implemented to include a processor for processing data, the processor including at least one functional unit (FU) for executing instructions on data, wherein the at least one functional unit (FU) has at least one register associated therewith, the register being operable to hold one or more addresses of one or more registers associated with the at least one functional unit (FU), the one or more registers being addressed by the instructions for providing a direct any-to-any connection between the one or more registers associated with the at least one functional unit (FU), thereby providing a single cycle data path between the at least one functional unit (FU) and its associated one or more registers. Such a configuration provides the UPSA 200 with increasing speed for streaming and/or processing data.
Conventional microprocessors employ functional units (FUs) which are separated from register files containing register contents. For conventionally achieving a short cycle time, instruction processing is split into several cycles. For a register to register instructions, for example adding register R2 content to register R3 content and record a corresponding addition result in register R1, one cycle is used to read register contents R2, R3 from the register file, a second cycle performs the actual addition operation and a third cycle is used to store the result in register R3 in the register file. In transport triggered processor architectures (TTA), registers are attached to a functional unit (FU) and are not separate in a conventional manner. In an extreme, an instruction set for a TTA involves only one instruction: “move”. Multiple FU's are beneficially connected via crossbar switches in a TTA.
For achieving an optimized processor design, for example for use in the aforementioned UPSA 200, single cycle functional units (FU) are beneficially employed.
In
In
Referring again to the UPSA 200, an example operation is presented in
In a first step illustrated in
In a second step illustrated in
In a third step illustrated in
In a fourth step illustrated in
In conclusion, there is described in the foregoing a processor for processing data, the processor including at least one functional unit (FU) for executing instructions on data, wherein in that the at least one functional unit (FU) has at least one register associated therewith, the register being operable to hold one or more addresses of one or more registers associated with the at least one functional unit (FU), the one or more registers being addressed by the instructions for providing a direct any-to-any connection between the one or more registers associated with the at least one functional unit (FU), thereby providing a single cycle data path between the at least one functional unit (FU) and its associated one or more registers. Such a processor is susceptible to being utilized in various data processing systems and apparatus, for example in the aforementioned UPSA 200. The UPSA 200 beneficially includes at least one BaRT-controller for configuring the UPSA 200, for example with regarded to data pathways therein and also functions to be performed by functional units (FU) of its processor core 230. Moreover, the UPSA 200 is susceptible to being used in consumer electronic devices such as multimedia apparatus, video systems, Internet-coupled devices, wireless communication devices, personal computers and mobile telephones (cell phones) as well as infrastructure devices such as servers, wireless telephone infrastructure, satellites, network servers, transport systems to mention a few diverse examples.
Modifications to embodiments of the invention described in the foregoing are possible without departing from the scope of the invention as defined by the accompanying claims.
The invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by on in connection with the instruction execution system, apparatus, or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read-only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Input/output or I/O-devices (including, but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable the data processing system or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
Expressions such as “including”, “comprising”, “incorporating”, “consisting of”, “have”, “is” used to describe and claim the present invention are intended to be construed in a non-exclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural.
Number | Date | Country | Kind |
---|---|---|---|
09164014.4 | Jun 2009 | EP | regional |