1. Technological Field
The present disclosure relates to a microcomputer with reduced power consumption and performance enhancement, and to methods of designing and operating the same.
2. Description of the Related Technology
Nowadays, a typical embedded system requires high performance to perform tasks such as video encoding/decoding at run-time. It should consume little energy so as to be able to work hours or even days using a lightweight battery. It should be flexible enough to integrate multiple applications and standards in one single device. It has to be designed and verified in a short time to market despite substantially increased complexity. The designers are struggling to meet these challenges, which call for innovations of both architectures and design methodology.
Coarse-grained reconfigurable arrays (CGRAs) are emerging as potential candidates to meet the above challenges. Many designs have been proposed in recent years. These architectures often comprise tens to hundreds of functional units (FUs), which are capable of executing word-level operations instead of bit-level ones found in common field programmable gate arrays (FPGAs). This coarse granularity greatly reduces the delay, area, power and configuration time compared with FPGAs. On the other hand, compared with traditional “coarse-grained” programmable processors, their massive computational resources enable them to achieve high parallelism and efficiency. However, existing CGRAs have not yet been widely adopted mainly because of programming difficulty for such a complex architecture.
To address this problem, B. Mei et al., in “ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix,” International Conference on Field-Programmable Logic and Applications location, have proposed a microcomputer with tightly coupled very long instruction word (VLIW) processor and coarse-grained reconfigurable matrix, called ADRES architecture (Architecture of Dynamically Reconfigurable Embedded Systems)—see
It is an object of embodiments of the present disclosure to provide a good microcomputer as well as methods of operating the same. An advantage of embodiments of the present disclosure is reduced power consumption.
The above objective is accomplished by a method and device according to the present disclosure.
In a first aspect, the present disclosure provides a microcomputer for executing an application. The microcomputer comprises a heterogeneous coarse grained reconfigurable array comprising a plurality of functional units, optionally register files, and memories, and at least one processing unit supporting multiple threads of control. The at least one processing unit may be a VLIW processor. At least one processing unit is adapted for allowing each thread of control to claim one or more of the functional units to work for that thread. It is a particular feature of embodiments of the present disclosure that at least one processing unit is adapted for allowing the threads of control to reconfigure at run-time the claiming of particular types of functional units to work for that thread depending on requirements of the application, e.g. workload, and/or the environment, e.g. current usage of FU's. The reconfiguration enables run-time selection of a different pre-compiled version of a same application, different versions of the same application making use of at least one other type of functional unit. This means that resources for a configured stream can be reconfigured at run-time, depending on the requirements of the application and/or the current workload. This way, the present disclosure provides multithreading with dynamic allocation of CGA resources. Based on the demand of the application and the current utilization of the CGRA, different resource combinations can be claimed.
The claiming of particular types of functional units is heterogeneous resource claiming, where heterogeneous functional units may for example have different instruction sets. As an example only, threads requiring mostly scalar operations and lower memory may claim other types of functional units than threads which are highly vector intensive and/or highly memory bandwidth intensive in their requirements.
In a microcomputer according to embodiments of the present disclosure, wherein the processing unit adapted for allowing the threads of control to reconfigure at runtime the claiming of functional units may include claiming a particular number of functional units depending on requirements of the application and the environment.
In a microcomputer according to embodiments of the present disclosure, a set of functional units, optionally register files, and memories may belong to a particular Dynamic Voltage and Frequency Scaling (DVFS) domain, and the voltage and frequency of this domain can be controlled independently of another domain. Hence when a processing unit claims a resource, it can also set the voltage and frequency of the appropriate domains it claims. Again, in accordance with embodiments of the present disclosure, the selection of a particular DVFS domain by a processing unit may be based on demand of the application and on current utilization of the CGRA. Different DVFS domains can be claimed by different threads. This means that different threads can simultaneously run, on a same CGRA, at different DVFS domains.
In a microcomputer according to any embodiments of the present disclosure, a set of functional units, optionally register files, and memories may belong to a particular adaptive body biasing (ABB) domain. Adaptive body biasing is a technique where the bias voltage of a selected part of a chip (domain) is adapted. A change in the bias voltage of the bulk of the domain implies that the threshold voltage of the transistors in that domain changes. This results in a change in performance. Based on the required increased or reduced performance appropriate positive or negative voltage can be applied to reach a correct threshold voltage Vth of the PMOS transistors and the appropriate threshold voltage Vth for the NMOS transistors in the corresponding domain. In accordance with embodiments of the present disclosure, the body biasing of a particular domain can be controlled independently of the body biasing of another domain. Hence when a processing unit claims a resource, it can also set the body biasing of the appropriate domains it claims. Again, in accordance with embodiments of the present disclosure, the selection of a particular body biasing domain by a processing unit may be based on demand of the application and on current utilization of the CGRA. Different body biasing domains can be claimed by different threads. This means that different threads can simultaneously run, on a same CGRA, at different body biasing domains.
An overview of DVFS and ABB for adaptive workloads can be found in “Combined Dynamic Voltage Scaling and Adaptive Body Biasing for Lower Power Microprocessors under Dynamic Workloads,” Steven M Martin, Krisztian Flautner, Trever Mudge, David Blaauw, Proceedings of ICCAD 2002, incorporated herein by reference.
In a microcomputer according to embodiments of the present disclosure, a set of functional units, optionally register files, and memories may belong to a power domain which can be switched on and off independently of another domain. The power domains may be adapted to be power gated to go to a low leakage mode.
In accordance with embodiments of the present disclosure, the reconfiguration may enable run-time adaptation of a same application, where several versions of the same application represent a trade-off, e.g. a Pareto trade-off, between two parameters, e.g. energy and time.
In a microcomputer according to embodiments of the present disclosure, the processing unit may be adapted for supporting multi-stream capability.
A microcomputer according to embodiments of the present disclosure may be adapted for having the claimed functional units for one thread of control to operate independently from the claimed functional units for another thread of control.
In a second aspect, the present disclosure provides a method for executing on a system comprising a heterogeneous coarse grained reconfigurable array comprising a plurality of functional units an application having multiple threads of control. The method comprises the threads of control each claiming by means of at least one processing unit a different set of functional units to work for that thread, monitoring a run-time, e.g. current, situation of the system with respect to the occupation of the functional units, and, based on the occupation of the functional units and on application requirements, allowing the threads of control to claim, by means of the at least one processing unit, different functional units to work for that thread. This may include selecting a different version of a precompiled software and loading this different version of the software to the configuration memory on the CGRA to execute.
A method according to embodiments of the present disclosure may furthermore comprise, when the run-time situation changes, selecting another precompiled version of the same application that suits better the current situation needs, the other precompiled version of the same application making use of at least one other type of functional units.
In a method according to embodiments of the present disclosure, allowing the threads of control to claim different functional units to work for that thread may include claiming sets of functional units to work in an instruction level parallelism (ILP) fashion, a thread level parallelism (TLP) fashion, a data level parallelism (DLP) fashion or a mix of two or more of these fashions.
In a third aspect, the present disclosure provides a run-time engine adapted for monitoring a system comprising a heterogeneous coarse grained reconfigurable array comprising a plurality of functional units. The monitored system runs an application having multiple threads of control loaded on the CGRA for execution. The run-time engine is adapted for monitoring the system with respect to the current occupation of the functional units and application requirements, and based on the occupation of the functional units and on the application requirements, selecting a different pre-compiled version of the application, different pre-compiled versions of the application making use of at least one other type of functional units to work for a thread of control.
In a further aspect, the present disclosure provides a method for converting application code into execution code suitable for execution on a microcomputer as in any of the embodiments of the first aspect. The method comprises: obtaining application code, the application code comprising at least a first and a second thread of control, and converting at least part of said application code for the at least first and second thread of control, said converting including providing different versions of code for making use of different sets of resources, different sets of resources including different types of functional units, and insertion of selection information into each thread of control, the selection information being for selecting a different version of code, depending on requirements of the application and a particular occupation of the functional units.
In yet another aspect, the present disclosure also provides a method for executing an application on a microcomputer as defined in any of the embodiments of the first aspect. The method comprises executing the application on the microcomputer as at least two process threads on a first set of at least two non-overlapping processing units; depending on the current occupation of functional units in the first set of at least two non-overlapping processing units and on requirements of the application, dynamically switching the microcomputer into a second set of at least two non-overlapping processing units, the second set being different from the first set; and executing the at least two process threads of the application on the second set of at least two processing units.
A method for executing an application according to embodiments of the present disclosure may furthermore comprise controlling each processing unit by a separate memory controller.
Particular and preferred aspects of the disclosure are set out in the accompanying independent and dependent claims. Features from the dependent claims may be combined with features of the independent claims and with features of other dependent claims as appropriate and not merely as explicitly set out in the claims.
For purposes of summarizing the disclosure and the advantages achieved over the prior art, certain objects and advantages of the disclosure have been described herein above. Of course, it is to be understood that not necessarily all such objects or advantages may be achieved in accordance with any particular embodiment of the disclosure. Thus, for example, those skilled in the art will recognize that the disclosure may be embodied or carried out in a manner that achieves or optimizes one advantage or group of advantages as taught herein without necessarily achieving other objects or advantages as may be taught or suggested herein.
The above and other aspects of the disclosure will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.
The disclosure will now be described further, by way of example, with reference to the accompanying drawings, in which:
The drawings are only schematic and are non-limiting. In the drawings, the size of some of the elements may be exaggerated and not drawn on scale for illustrative purposes. The dimensions and the relative dimensions do not necessarily correspond to actual reductions to practice of the disclosure.
Any reference signs in the claims shall not be construed as limiting the scope.
In the different drawings, the same reference signs refer to the same or analogous elements.
The present disclosure will be described with respect to particular embodiments and with reference to certain drawings but the disclosure is not limited thereto but only by the claims.
Furthermore, the terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequence, either temporally, spatially, in ranking or in any other manner. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the disclosure described herein are capable of operation in other sequences than described or illustrated herein.
Moreover, the terms top, under and the like in the description and the claims are used for descriptive purposes and not necessarily for describing relative positions. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the disclosure described herein are capable of operation in other orientations than described or illustrated herein.
It is to be noticed that the term “comprising,” used in the claims, should not be interpreted as being restricted to the means listed thereafter; it does not exclude other elements or steps. It is thus to be interpreted as specifying the presence of the stated features, integers, steps or components as referred to, but does not preclude the presence or addition of one or more other features, integers, steps or components, or groups thereof. Thus, the scope of the expression “a device comprising means A and B” should not be limited to devices consisting only of components A and B. It means that with respect to the present disclosure, the only relevant components of the device are A and B.
Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment, but may. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to one of ordinary skill in the art from this disclosure, in one or more embodiments.
Similarly it should be appreciated that in the description of exemplary embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this disclosure.
Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the disclosure, and form different embodiments, as would be understood by those in the art. For example, in the following claims, any of the claimed embodiments can be used in any combination.
It should be noted that the use of particular terminology when describing certain features or aspects of the disclosure should not be taken to imply that the terminology is being re-defined herein to be restricted to include any specific characteristics of the features or aspects of the disclosure with which that terminology is associated.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the disclosure may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
A microcomputer according to embodiments of the present disclosure is a CGRA architecture comprising two distinct parts for the datapath: VLIW parts and CGRA parts. In a microcomputer according to embodiments of the present disclosure, a very long instruction word (VLIW) digital signal processor (DSP) is combined with a 2-D coarse-grained heterogeneous reconfigurable array (CGRA), which is extended from the VLIWs datapath. VLIW architectures execute multiple instructions per cycle, packed into a single large “instruction word” or “packet,” and use simple, regular instruction sets. The VLIW DSP efficiently executes control-flow code by exploiting instruction-level parallelism (ILP) of 1 or more FU's. The array, containing many functional units, accelerates data-flow loops by exploiting high degrees of loop-level parallelism (LLP). The architecture template allows designers to specify the interconnection, the type and the number of functional units.
In the context of a microcomputer, a functional unit can be qualified by three aspects:
the width of the operands it can operate on: e.g. in
the set of operations that can be performed: e.g. in
connection of the FU to other FUs: e.g. in
If one or more of the above aspects of a functional units changes, the FU is said to be of a different type. A change in one or more of the above aspects implies that the compiler has to find a completely new way of mapping code on a “new set of FU types” or the code has to be manually transformed to enable a new mapping on the “new set of FU types.”
The CGRA template according to embodiments of the present disclosure thus tightly couples a very-long instruction word (VLIW) processor 21 and a coarse-grained array 22 by providing two functional modes on the same physical resources. It brings advantages such as high performance, low communication overhead and easiness of programming. An application written in a programming language such as e.g. C can be quickly mapped onto a CGRA instance according to embodiments of the present disclosure.
The CGRA according to embodiments of the present disclosure is a flexible template instead of a concrete instance. An architecture description language is developed to specify different instances. A script-based technique allows a designer to easily generate different instances by specifying different values for the communication topology, supported operation set, resource allocation and/or timing of the target architecture. Together with a retargetable simulator and compiler, this tool-chain allows for architecture exploration and development of application domain specific processors. As CGRA instances according to embodiments of the present disclosure are defined using a template, the VLIW width, the array size, the interconnect topology, etc. can vary depending on the use case.
The CGRA template according to embodiments of the present disclosure includes many basic components, including computational, storage and routing resources. The CGRA part is an array of computational resources and storage interconnected in a pre-described way. The computational resources are functional units (FUs) 26, 27, 28 that are capable of executing a set of word-level operations selected by a control signal. The functional units 26, 27, 28 can be heterogeneous (in terms of instructions supported in one functional unit, SIMD size, connectivity to other functional units, etc.), or they can be homogeneous. They are connected in a pre-determined way by means of routing resources (not illustrated). Each functional unit can internally have many SIMD slots to operate on different data in parallel on a same instruction. The CGA array also comprise transition nodes or pipeline registers between the different functional units as well as register files to store intermediate data. Each of the functional units and the interconnect can be configured at every cycle to execute another instruction. The CGRA functional units can be of many types, for example scalar, vector, pack /unpack, load/store type etc. The scalar units do not support high SIMD and are meant to operate on data that have limited SIMD like address calculation or other such operations. The vector FUs support SIMD and can do data crunching in parallel.
Data storages such as register files (RFs) 29, 30, 35 and memory blocks 31 can be used to store intermediate data. The routing resources (not illustrated in
Also the data memory can be of two types: scalar memory and vector memory. The vector memories can also be of different sizes in both depth and/or width of the vector size. The data memories may be connected directly to the FUs that support load/store or may be connected to the FUs via a data memory queue (DMQ). The DMQ is used to hide a bank conflict latency in case many functional units try to access data from a same bank in parallel. Data memories can be local to a thread or global shared across different threads.
The L2 instruction memory may also comprise two parts (one for CGA and one for VLIW instructions). Alternatively, it may comprise one part only (combined VLIW and CGA instructions). The L1 instruction memory comprises two parts: one for the VLIW and one for the CGA instructions. L1 instruction memory for the CGA is called “configuration memory.” There is a further level 0 or L0 instruction memory for the CGA, which is called “configuration cache.” The “configuration memory” comprises the instructions for one mode of the program (so several loops) and the “configuration cache” only comprises instructions for one of two loops.
Each VLIW part is a multi-issue or a single issue processor which can interface with the rest of the platform. The VLIW part is tuned for running scalar and control code. It is not meant for running heavy data processing code. The VLIW processor 21 includes several FUs 32 and at least one multi-port register file 30, as in typical VLIW architectures, but in this case the VLIW processor 21 is also used as the first row of the reconfigurable array. Some FUs 32 of this first row are connected to the memory hierarchy 33, depending on the number of available ports. Data accesses to the memory of the unified architecture are done through load/store operations available on these FUs 32. When compiling, with a compiler, applications for a microcomputer according to embodiments of the present disclosure, loops are modulo-scheduled for the CGA 22 and the remaining code is compiled for the VLIW 21. By seamlessly switching the microcomputer between the VLIW mode and the CGA mode at run-time, statically partitioned and scheduled applications can be run on the CGRA instance according to embodiments of the present disclosure with a high number of instructions-per-clock (IPC).
To remove the control flow inside loops, the FUs 26, 27, 28 support predicated operations. The results of the FUs can be written to data storages such as the distributed RFs 29, 35, i.e. RFs 29, 35 dedicated to a particular functional unit 26, 27, which RFs 29, 35 are small and have fewer ports than the shared data storage such as register files 30, which is at least one global data storage shared between a plurality of functional units 26, 27, 28, or the results of the FUs 26, 27, 28 can be routed to other FUs 26, 27, 28. To guarantee timing, the outputs of FUs 26, 27, 28 may be buffered by an output register. Multiplexers are part of the routing resources for interconnecting FUs 26, 27, 28 into at least two non-overlapping processing units. They are used to route data from different sources.
The microcomputer 20 according to embodiments of the present disclosure comprises a plurality of memories. The first memory 31 is a memory with the same width as the scalar functional units 26, e.g. a 32-bit memory. The first memory 31 may comprise a plurality, e.g. 4 in the embodiment illustrated, of memory banks. This memory 31 is connected to a plurality of FUs 26 in the scalar datapath, e.g. 4 FUs 26 in the embodiment illustrated, as well as to the VLIW functional units 32. In addition, the CGRA instance according to embodiments of the present disclosure comprises also at least one, for example a plurality of scratchpad memories 34, e.g. two scratchpad memories 34, each connected only to one FU 27 in the array. Therefore, no DMQ is needed by those two scratchpad memories 34 resulting in power and area savings. In order to still enable a high memory throughput, both memories 34 support only wide memory accesses loading/storing vectors of for example, but not necessarily, the same width as the FUs 27 of the second part 24, e.g. 256 bit. Moreover, these vector loads and stores reduce the number of packing and unpacking instructions needed for the vector processing resulting in performance gain. The idea is that computation is kept highly parallel in vector datapath and the scalar datapath is used mainly for address computation or for the part of the application where highly parallel DLP cannot be used (e.g. tracking in WLAN).
A CGRA architecture may be split up into partitions. A partition is an arbitrary grouping of resources of any size: a partition can be a single FU, or it can comprise a plurality of FUs, RFs, memories, . . . Each partition can be viewed as a downscaled CGRA architecture and can optionally be partitioned further down the hierarchy. Each partition can simultaneously execute a programmer-defined thread (multi-threading).
Each thread has its own resource requirements. A thread that is easy to parallelize requires more computation resources, thus executing it on a larger partition results in the optimal use of the ADRES array and vice versa. A globally optimal application design demands that the programmer knows the IPC of each part of the application, so that he can find an efficient array partition for each thread.
One way to find out how many resources are required by each part of a certain application is profiling. A programmer starts from a single-threaded application and profiles it on a large single-threaded CGRA architecture. From the profiling results, kernels with low IPC are identified as the high-priority candidates for threading. Depending on the resource demand of the threads, a programmer may statically plan on how and when the CGRA should be split into partitions during application execution. When the threads are well organized, the full array can be optimally utilized.
A thread is always started/stopped (in other words: operated) using a VLIW processor 21. Each VLIW processor 21 can start and stop a new thread independently of each other. When a VLIW processor 21 starts a thread, it claims a set of FUs 26, 27, 28 from the CGRA FUs, which can then operate in a synchronous fashion to execute a thread. Furthermore a VLIW 21 can also spawn threads to other VLIWs. For example VLIW1 spawns two threads, where each thread claims a set of mutually exclusive resources from the CGRA FUs and memories. Each of the two threads run on say VLIW1 and VLIW2 respectively. This example is shown in
Furthermore there can be another example (not illustrated) where the first VLIW VLIW1 spawns two threads, and where two sets of CGA resources and data memory claims are made for the two threads. However, these threads are run independently of each other and there is a “join” after the two threads finish executing.
Threads may communicate with one another, either via a shared memory or via FIFO or other mechanisms.
Resources can be reserved at compile time, where the code of the VLIW processor defines the thread(s) and its (their) resources required on the CGA. For example, a first VLIW processor can invoke one of the two options: option 1 where code 1 is run on a CGA with resources set with X functional units and P memories, or code 2 which is functionally the same or different with resources set with Y functional units and Q memories. At run-time, the preferred option is selected, based on the application requirements and the environment, i.e. the current usage of resources for other applications which are running.
In embodiments of the present disclosure, the CGA functional units may have different modes of operation:
As indicated with respect to
According to further embodiments of the present disclosure, a set of FUs and register files and memories may belong to a dynamic voltage and frequency scaling (DVFS) domain, and the voltage and frequency of this domain can be controlled independently of the voltage and frequency of another domain. A set of FUs, register files and memories also can belong to a power domain which can be switched on and off independently from another power domain as well. Therefore, in accordance with embodiments of the present disclosure, when a VLIW processor 40, 41 claims a set of resources, it can also set the voltage and frequency of the appropriate domains that it claims.
Based on the demand of the application to be executed, and on the current utilization of the CGRA as mentioned earlier different resource combinations and modes can be claimed.
A microcomputer according to embodiments of the present disclosure can fully support run-time reconfiguration and multistream capability and a combination of those. Under multistream capability is understood that two asynchronous streams (e.g. LTE—Long Term Evolution, and WLAN—Wireless Local Area Network) are running in parallel on the platform, e.g. in a master-master mode. Under run-time reconfiguration is understood that the resources for a configured stream (e.g. LTE) can be reconfigured (e.g. to WLAN). This is linked to handover mechanisms. The reconfigurability can be internal and external, where external means re-loading the new standard to an L2 instruction memory and where internal means that within the microcomputer according to embodiments of the present disclosure the appropriate modulation and coding scheme (MCS) is loaded to an L1 instruction memory either via caching mechanisms (for the VLIW part) or via direct memory access (DMA) (for the CGA part).
The run-time reconfiguration enables also run-time adaptation of a same application, where several versions of the same application representing a trade-off, such as for example a Pareto trade-off, (e.g. between the energy and time) are available. Those different versions of the same application are compiled and kept in the higher levels of instruction memory. This may be different programs compiled for different allocations of resources, or different DVFS settings etc.
When a new application is started, a run-time engine 90 (illustrated in
While the disclosure has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative or exemplary and not restrictive. The foregoing description details certain embodiments of the disclosure. It will be appreciated, however, that no matter how detailed the foregoing appears in text, the disclosure may be practiced in many ways. The disclosure is not limited to the disclosed embodiments.
Number | Date | Country | Kind |
---|---|---|---|
11165893.6 | May 2011 | EP | regional |
Any and all applications for which a foreign or domestic priority claim is identified in the Application Data Sheet as filed with the present application are hereby incorporated by reference under 37 C.F.R. §1.57. This application is a continuation of PCT Application No. PCT/EP2012/058926, filed May 14, 2012, which claims priority under 35 U.S.C. §119(e) to U.S. Provisional Patent Application No. 61/507,957, filed Jul. 15, 2011. Each of the above applications is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61507957 | Jul 2011 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2012/058926 | May 2012 | US |
Child | 14072584 | US |