The described embodiments relate to systems and methods of developing a system architecture, and in particular, relate to systems and methods of developing a system architecture based on a plurality of optimization parameters.
The design and development of systems requires extensive analysis and assessment of the design space, not only due to the assorted nature of design parameters, but also due to the diversity in architecture for implementation. Given specifications and system requirements, the aim of designers is to reduce a large and complex design space into a set of feasible design solutions meeting performance objectives and functionality.
For systems based on operational constraints the selection of an optimal architecture for system design is an important step in the development process. Design space architecture can have innumerable design options for selection and implementation based on the parameters of optimization. Selection of the optimal architecture from the design space that satisfies all the performance parameter objectives may be useful for the present generation of System-on-chip (SoC) designs and Very Large Scale Integration (VLSI) design. As it is possible to implement different functions of a system on different hardware components, the architecture design space becomes more complex to analyze. In the case of high level synthesis, performing design space exploration to choose candidate architecture by concurrently satisfying many operating constraints and performance parameters is considered an important stage in the whole design flow. Since the design space is huge and complex there exists a desire to efficiently explore candidate architectures for the system design based on the application to be executed. The method for exploration of candidate architecture should not only be less in terms of complexity factor and time but should also explore the variant in an efficient way meeting specifications provided. The process of high-level synthesis design is very complicated and descriptive and is usually performed by system architects. Depending on the application, the process of defining the problem, performing design space exploration and the other steps required for its successful accomplishment may be very time consuming. Furthermore, recent advancements in areas of communications and multimedia have led to the growth of a wide array of applications requiring huge data processing at minimal power expense. Such data hungry applications demand satisfactory performance with power efficient hardware solutions. Hardware solutions should satisfy multiple contradictory performance parameters such as power consumption and time of execution, for example. Since the selection process for the best design architecture is complex, an efficient approach to explore the design space for selecting a design option is desirable.
In a first aspect, some embodiments provide a method of developing a system architecture comprising:
The system architecture may comprise a Register Transfer Level data path circuit. The system architecture may further comprise a Register Transfer Level control timing sequence. The Register Transfer Level data path circuit may be configured to generate output data as a result of performing a sequence of operations on data using Register Transfer Level modules, wherein the Register Transfer Level modules include the number of each kind of resources represented by the selected vector. The Register Transfer Level modules may be selected from the group consisting of registers for storage of data, memory modules, latches for sinking of data, multiplexers and demultiplexers.
The kinds of resources R1, . . . Rn may selected from the group consisting of adders, subtractors, clock oscillators, multipliers, divider, comparator, Arithmetic Logic Unit (ALU), integrator, summer and other functional modules.
The optimization parameters may be selected from the group consisting of hardware area, cost, time of execution, and power consumption.
The Register Transfer Level control timing sequence may provide a control configuration for a data path circuit to provide timing and synchronization required by data traversing through the Register Transfer Level modules of the data path circuit.
In accordance with some embodiments, a final optimization parameter is a hardware area of a total number of all kinds of resources R1, . . . Rn, and wherein, for the hardware area, the priority factor function of each kind of resource R1, . . . Rn is an indicator of a change of area contributed by a change in the number of the kind of resource Ri, wherein 1≦i≦n
For the hardware area, the priority factor for each kind of resource R1, . . . Rn that is not a clock oscillator may be calculated from NRi, ΔNRi, KRi wherein NRi is the number of the kind of resource Ri, KRi is an area occupied by the kind of resource Ri, ΔNRi·KRi is a change of area contributed by the kind of resource Ri, wherein Ri is a member of the kinds of resources R1, . . . Rn; and
For the hardware area, the priority factor for each kind of resource R1, . . . Rn that is not a clock oscillator may be of the form:
In accordance with some embodiments, the plurality of optimization parameters comprise a time of execution of a total number of all kinds resources R1, . . . Rn, and wherein, for the time of execution, the priority factor function for each kind of resource R1, . . . Rn is a function of the rate of change of a cycle time with a change in the number NRi of the kind of resources Ri at a maximum clock period, wherein 1≦i≦n and Ri is a member of the kinds of resources R1, . . . Rn.
The priority factor function for the time of execution of the resources R1, . . . Rn that is not a clock oscillator may be calculated by NRi, TRi, Tpmax, wherein NRi is the number of the kind of resource Ri, TRi a number of clock cycles required by the kind of resource Ri to finish each operation, Tp is the time period of the clock, Tpmax is the maximum clock period; and
The priority factor function for the time of execution of the resources R1, . . . Rn that is not a clock oscillator may be of the form:
In accordance with some embodiments, the plurality of optimization parameters comprise a power consumption of the resources R1, . . . Rn, and wherein, for the power consumption, the priority factor function for each kind of resource R1, . . . Rn is a function of a change in power consumption per unit area due to deviation of clock frequency from maximum to minimum and a change in the number NRi of the kind of resource Ri at maximum clock frequency, wherein 1≦i≦n, and Ri is a member of the kinds of resources R1, . . . Rn.
The priority factor function for the power consumption of the resources R1, . . . Rn that is not a clock oscillator may be calculated by NRi, KRn, ΔNRi, (pc)max, pc wherein NRi is the number of resource Ri, KRn is an area occupied by resource Ri, ΔNRn·KRn is a change of area contributed by resource Ri, pc is power consumed per area unit resource at a particular frequency of operation, (pc)max is power consumed per area unit resource at a maximum clock frequency; and
The priority factor function for the power consumption of the resources R1, . . . Rn that is not a clock oscillator may be of the form:
In accordance with some embodiments, the plurality of optimization parameters comprise a total cost of the total number of all kinds resources R1, . . . Rn, and wherein, for the total cost, the priority factor function for each kind of resource R1, . . . Rn is an indicator of change in total cost of the total number of all kinds resources R1, . . . Rn with respect to a change in the number of the kind of resource Ri and the cost per unit resource, wherein 1≦i≦n.
The priority factor function for each kind of the resources R1, . . . Rn that is not a clock oscillator may be calculated by NRi, KRi, ΔNRi, CRi, wherein NRi is the number of the kind of resource Ri, KRi is an area occupied by the kind of resource Ri, ΔNRi·KRi is a change of area contributed by the kind of resource Ri, CRi is the cost per area unit of the kind of resource Ri; and
For the cost, the priority factor function for each kind of the resources R1, . . . Rn that is not a clock oscillator may be of the form:
In accordance with some embodiments, the method may further comprise, for each optimization parameter:
In accordance with some embodiments, the method may further comprise, for each of the plurality of optimization parameters, determining whether the constraint value for the optimization parameter is valid by:
In accordance with some embodiments, the method may further comprise determining whether the set of vectors is valid by determining whether the set of vectors is null; and upon determining that the set of vectors is not valid, relaxing the constraint values for each optimization parameter by predetermined percentage.
In accordance with some embodiments, the method may further comprise representing the combination of the number of resources R1, . . . Rn of the selected vector in a temporal and a spatial domain using a sequencing and binding graph and a plurality of registers.
In accordance with some embodiments, the method may further comprise determining a multiplexing scheme for the resources R1, . . . Rn of the selected vector, with inputs, outputs, operations, interconnections and time steps.
In accordance with some embodiments, the method may further comprise producing a Register Transfer Level data path circuit using the multiplexing scheme.
In accordance with some embodiments, the method may further comprise producing an integrated circuit using the system architecture.
In accordance with some embodiments, the set of vectors based on the intersection of the satisfying sets of vectors is a pareto set of vectors.
In another aspect, embodiments described herein provide a non-transitory computer-readable storage medium comprising instructions for execution on a computing device, wherein the instructions, when executed, perform acts of a method of developing a system architecture, wherein the method comprises:
In a further aspect, embodiments described herein provide a system of developing a system architecture comprising:
In another aspect, embodiments described herein provide a method of determining a vector representing a combination of a number of each kind of resource R1, . . . Rn available for constructing a system architecture comprising:
For a better understanding of embodiments of the systems and methods described herein, and to show more clearly how they may be carried into effect, reference will be made, by way of example, to the accompanying drawings in which:
The drawings, described below, are provided for purposes of illustration, and not of limitation, of the aspects and features of various examples of embodiments of the invention described herein. The drawings are not intended to limit the scope of the applicants' teachings in any way. For simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. The dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
It will be appreciated that numerous specific details are set forth in order to provide a thorough understanding of the exemplary embodiments described herein. However, it will be understood by those of ordinary skill in the art that the embodiments described herein may be practiced without these specific details. In other instances, well-known methods, procedures and components have not been described in detail so as not to obscure the embodiments described herein. Furthermore, this description is not to be considered as limiting the scope of the embodiments described herein in any way, but rather as merely describing implementation of the various embodiments described herein.
The embodiments of the systems and methods described herein may be implemented in hardware or software, or a combination of both. However, these embodiments may be implemented in computer programs executing on programmable computers, each computer including at least one processor, a data storage system (including volatile and non-volatile memory and/or storage elements), and at least one communication interface. For example, the programmable computers may be a server, network appliance, set-top box, embedded device, computer expansion module, personal computer, laptop, personal data assistant, or mobile device. Program code is applied to input data to perform the functions described herein and to generate output information. The output information is applied to one or more output devices, in known fashion. In some embodiments, the communication interface may be a network communication interface. In embodiments in which elements of the invention are combined, the communication interface may be a software communication interface, such as those for inter-process communication. In still other embodiments, there may be a combination of communication interfaces.
Each program may be implemented in a high level procedural or object oriented programming or scripting language, or both, to communicate with a computer system. However, alternatively the programs may be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Each such computer program may be stored on a storage media or a device (e.g. ROM or magnetic diskette), readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein. Embodiments of the system may also be considered to be implemented as a non-transitory computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
Furthermore, the system, processes and methods of the described embodiments are capable of being distributed in a computer program product including a physical non-transitory computer readable medium that bears computer usable instructions for one or more processors. The medium may be provided in various forms, including one or more diskettes, compact disks, tapes, chips, magnetic and electronic storage media, and the like. The computer useable instructions may also be in various forms, including compiled and non-compiled code.
Embodiments described herein may provide Design Space Exploration (DSE) and a formalized High Level Synthesis (HLS) design flow with multi parametric optimization objective using the described DSE approach. The proposed approach may resolve issues related to DSE such as the precision of evaluation, time exhausted during evaluation and also automation of the exploration process. During DSE a conflicting situation may exist to concurrently maximize the accuracy of the exploration process and minimize the time spent during DSE analysis. Embodiments described herein may be capable of reducing the number of architectural variants to be analyzed for accurate selection of a design point. Embodiments described herein may involve determining the Priority Factor (PF) of the resources for final organization of the design space in increasing or decreasing order and may not require graphs or hierarchical tree arrangements to analyze the candidate variants. Embodiments described herein may be capable of simultaneously optimizing many performance parameters, such as for example time of execution, power consumption, hardware area and cost.
Embodiments described herein may provide an approach for finding design architecture with multi-parametric optimization objectives, which may be useful for accelerating design space exploration in HLS. Embodiments described herein may provide design steps of a multi-parametric optimized high level design flow useful for the generation of high data processing applications and complex SoC and VLSI design. Embodiments described herein may reduce the number of architectural variants to be analyzed for finding a candidate combination of a number of kinds of resources, and in accordance with some embodiments may be the pareto-optimized design point. Multi-parameter optimization and design space exploration in HLS design flow may allow automation of the proposed high level design for HLS tools. The architecture variant obtained by methods in accordance with embodiments described here may be used to implement integrated circuits such as SoC designs, field-programmable gate arrays (FPGA), Application Specific Integrated Circuits (ASICs), and so on.
Background on High Level Synthesis and Design Space Exploration
Interdependent tasks such as scheduling, allocation and module selection are important ingredients of the HLS design process. HLS is a methodology of transforming an algorithmic behavioral description into a Register Transfer Level (RTL) structure. The algorithmic description specifies the inputs and outputs of the behavior of the algorithm in terms of operations to be performed and data flow. A description of the algorithm may be represented in the form of an acyclic directed graph known as a sequencing graph. These graphs specify the input/output relation of the algorithm and the data dependency present in the data flow. The graph is defined in terms of its vertices and edges, where the vertices signify the operations and the edges indicate the data dependency present in the function. High level synthesis is therefore a conversion from the behavioral description to its respective hardware description in the form of memory elements, storage units, multiplexers/demultiplexers and the necessary interconnections. The RT level representation includes a control unit and the data path unit.
For the present generation of VLSI technology with multi objective nature, the cost of solving the scheduling, allocation, and module selection by exhaustive search may be prohibitive. Multi-objective VLSI designs are used in low end ASICs with low power dissipation and acceptable performance as well as in high end ASICs with high performance requirements and satisfactory power expenditure. Hence, there is a desire for efficient design space exploration techniques to make efficient use of time due to time to market pressure, for example. Design space exploration is a procedure for analyzing the various design architectures in the design space to obtain an optimum, near optimum, or other candidate architecture for the behavioral description according to the predefined specifications. Design space exploration may be a challenge for researchers due to the heterogeneity of the objectives and parameters involved. An example of design space exploration is a multi-objective search problem where the optimization parameters may be hardware area, execution time, power consumption, cost, and so on. A trend in design space exploration is the reduction of the design space into a set of pareto optimal points by pareto optimal analysis. Sometimes even the pareto optimal set can be very large for analysis and selection of the design for system implementation. In order to assist in exploring the design space better, an accurate approach that is efficient in terms of time is desirable for high level synthesis design of embedded systems.
The Proposed Design Space Exploration Framework
Embodiments described herein explore candidate micro-architectures from the architecture design space. Embodiments described herein may also serve as a backbone for high level synthesis design flow. In general, exploring design space can be a tedious and time consuming task for the designer. It demands great accuracy and elaborate analysis to determine the optimum design configuration. The exploration of the best design variant in a large design space within a short time and with less complexity is desirable. The amount of influence a resource can have on each parameter to be optimized during its change can be determined and the described theory used to explain a real example through high level synthesis.
Analysis for Hardware Area of the Resources
Let the area of the resources be given as ‘A’. Ri denotes the resources available for system designing; where 1<i<n.
Rclk refers to the clock oscillator used as a resource providing the necessary clock frequency to the system. The total area can be represented as the sum of all the resources used for designing the system. Hence total area is given in equation (1):
A=ΣA(Ri) (1)
Area can be expressed as the sum of the resources i.e. adder/subtractor, multiplier, divider etc and also the clock frequency oscillator. Therefore for a system with ‘n’ functional resources equation (1) can also be represented as shown in equation (2):
A=(NR1·KR1+NR2·KR2+ . . . +NRn·KRn)+A(Rclk) (2)
Where NRi represents the number of resource Ri and ‘KRi’ represents the area occupied per unit resource ‘Ri’ (1<=i<=n); Applying partial derivatives to equation (2) with respect to NR1, NR2 . . . NRn yields equation (3), equation (4) and equation (5) respectively as shown below:
According to the theory of approximation by differentials the change in the total area can be approximated by equation (6):
where symbol ‘Δ’ is called the delta operator.
Substituting equation (3), (4) and (5) into equation (6) yields equation (7) shown below:
dA=ΔNR1−KR1+ΔNR2−KR2+ . . . +ΔNRn−KRn+ΔA(Rclk) (7)
The equation above indicates the rate of change of area with respect to resource R1, R2 . . . Rn. Here the clock oscillator has been considered a resource which contributes to the area occupied by the hardware resources.
The term Priority Factor (PF) will be used herein when exploring the design space in the proposed approach. The PF is a determining factor which helps judge the influence of a particular resource on the variation of the optimization parameters such as area, time of execution, power consumption, and so on. This PF will be used later to organize the architecture design space consisting of variants in increasing or decreasing order of magnitude. An example priority factor for area of the resource R1, R2 . . . Rn may be given as:
The factor defined above determines how the variation in area is affected by the change of number of that certain resource. Hence, the PF is the rate of change of area with respect to the change in number of resources.
Analysis for Time of Execution
For a system with ‘n’ functional resources the time of execution can be represented by the following formula:
Texe=[L+(N−1)·Tc] (13)
where ‘L’ represents latency of execution, ‘Tc’ represents the cycle time of execution, ‘N’ denotes the number of data elements to be processed.
Since the number of data elements to be processed is large for real life applications, ‘L’ can be ignored and cycle time (Tc) becomes a primary factor. The maximum cycle time with one operation in each time slot can be represented by equation (14):
T=(NR1·TR1+NR2·TR2+ . . . +NRn·TRn)·Tp (14)
NRi represents the number of resource of Ri and ‘TRi’ represents the number of clock cycles needed by resource ‘Ri’ (1<=i<=n) to finish each operation and ‘Tp’ is the time period of the clock. From the theory of approximation of differentials the change in the total cycle time can be approximated as in equation (15).
Applying partial derivatives to equation (14) with respect to NR1, NR2 . . . NRn and Tp will produce the following set of equations:
Now substituting equations (16), (17), (18) and (20) in equation (15). The substitution yields the following equation (21) below:
dTc=ΔNR1·TR1·Tp+ΔNR2·TR2·Tp+ . . . +ΔNRn·TRnTp+ΔTp(TR1·NR1+TR2·NR2+ . . . +TRn·NRn) (21)
Equation (21) represents the change in total cycle time with the change in the number of resources and the clock period (clock frequency).
ΔNR1·TR1·Tp=The change of ‘Tc’ caused by the change in the number of resource R1;Similarly,
ΔNRn·TRn·Tp=The change of ‘Tc’ caused by the change in the number of resource Rn.
Finally, ΔTp·(TR1·NR1+TR2·NR2+ . . . +TRn·NRn)=The change of ‘Tc’ caused by the change in clock period (clock frequency) and the change in the number of all resources available.
The priority factor (PF) can be defined for the ‘time of execution’ parameter. An example PF time of execution for the resource R1, R2 . . . Rn is given as:
The factors defined above indicate the rate of change of cycle time (Ta) with the change in number of resources at minimum clock frequency. For example, equation (22) indicates the rate of change of cycle time with a change in the number of that particular resource (e.g. change in number of adders/subtractors from one to three adders/subtractors) at minimum clock frequency.
Minimum clock frequency is considered because the clock period is the maximum at this frequency. Hence, the change in the number of a specific resource at maximum clock period will influence the change in the cycle time the most, compared to the change in cycle time at other clock periods. The PF will yield a real number, which will suggest the extent to which the change in number of that particular resource contributes to the change in cycle time.
Analysis for Power Consumption
Therefore for a system with ‘n’ functional resources the total power consumption (P) of the resources in a system can be represented by the following equation (26):
‘NRi’ represents the number of resource of resource Ri as mentioned before. ‘KRi’ represents the area occupied per unit resource Ri and ‘pc’ denotes the power consumed per area unit resource at a particular frequency of operation.
Using the theory of approximation of differentials the change in power consumption can be formulated as shown in equation (28):
Applying partial derivative to equation (27) will produce the following equations:
Substituting equations (29), (30), (31) and (33) in equation (28). yields equation (34) below:
dP=(ΔNR1·KR1·pc+ΔNR2·KR2·pc+ . . . +ΔNRn·KRn·pc)+Δpc·(KR1·NR1+KR2·NR2+ . . . +KRn·NRn) (34)
Equation (34) represents the change in total power consumption with the change in the number of all resources and the clock period (clock frequency).
ΔNR1·KR1·pc=The change of ‘P’ contributed by the change in the number of resource R1;
Similarly, ΔNRn·KRn·pc=The change of P contributed by the change in the number of resource Rn;
Finally, Δpc·(KR1·NR1+KR2·NR2+ . . . +KRn·NRn)=The change of ‘P’ contributed by the change in clock period (clock frequency) and the change in the number of all resources available.
The example PF defined from equations (35) to (37) indicate the rate of change in the total power consumption with the change in number of resources at maximum clock frequency. For example, equation (35) indicates the rate of change of total power consumption of system with the change in the number of that particular resource (e.g. change in number of adders from one to three) at maximum clock frequency. The PF will help arrange the architectural variants of the design space in increasing or decreasing order of magnitude depending on the parameter of optimization. This would further facilitate the selection of the optimal design point that satisfies, or nearly satisfies, all operating constraints and optimization requirements specified. Examples of nearly satisfying the constraints would be within a 5-10% threshold or other reasonable, acceptable amount. If the vectors of the satisfying set obey constraint values or exceed constraint value by a predetermined acceptable percentage or amount, then the vectors are said to satisfy or nearly satisfy the constraint respectively.
In the above equations, the maximum clock frequency was considered because the total power consumption is at the maximum at this frequency. Hence, the change in the number of a specific resource at maximum clock frequency will influence the change in the total power consumption (P) the most, compared to the change at other clock frequencies. The PF will yield a real number, which will suggest the extent to which the change in number of that particular resource contributes to the change in total power consumption for the system.
The PF is arranged in such a way that the resource with the minimum PF is chosen first, gradually increasing and then ending at the resource with the highest priority factor. The above rule may apply for all optimization parameters.
Analysis for Hardware Cost of Resources (Including Cost of Intermediate Memory as Storage Register During Scheduling)
Another example optimization parameter is the hardware cost of resources. Let the area of the resources be given as ‘A’. Ri denotes the resources available for system designing; where 1<i<n. ‘n’ represents the maximum resource available for designing. ‘Rclk’ refers to the clock oscillator used as a resource providing the necessary clock frequency to the system. The total area can be represented as the sum of all the resources used for designing the system, such as adder, multiplier, divider, clock frequency oscillator and the memory elements. Hence total area can be given as shown in equation (1c).
A=ΣA(Ri) (1c)
A=(NR1·KR1+NR2·KR2+ . . . +NRn·KRn)+A(Rclk)+NRM·KRM (2c)
Where ‘NRi’ represents the number of resource ‘Ri’, ‘KRi’ represents the area occupied per unit resource ‘Ri’, ‘NRM’ represents the number of memory elements present (such as registers) and ‘KRM’ represents the area occupied by each memory element. Let the total cost of all resources in the system is ‘CR’. Further, cost per area unit of the resource (such as adders, multipliers etc) is given as ‘CRi’, the cost per area unit of the clock oscillator is ‘CRclk’ and finally the cost per area unit of memory element is ‘CRM’. Therefore total cost of the resources is given as:
CR=(NR1·KR1+NR2·KR2+ . . . +NRn·KRn)·CRiA(Rclk)CRclk+NRM·KRM·CRM (3c)
Applying partial derivative to equation (3c) with respect to NR1 . . . NRn, with respect to NRM and with respect to ARclk yields equation (4c) to (7c) respectively as shown below:
Now using the theory of approximation by differentials, the change in the total area can be approximated by the following equation:
Substituting equations (4c) to (7c) into equation (8c) yields equation (9c) shown below.
Equation (9c) represents the change in total cost of resources with the change in the number of all resources and the clock period (clock frequency). The Priority Factor (PF) for cost of resources is defined as:
Priority Factor (PF) yields a real number, which suggests the extent to which the change in number of a particular resource contributes to the change in hardware cost. The PF is a determining factor which helps us to judge the influence of a particular resource on the variation of the optimization parameters like area, time of execution and power consumption. The equation (10c) and (11c) indicates the change of cost with respect to change in resource R1, . . . Rn. Similarly, equation (12c) indicates the change of cost of the system with respect to change in number of resource ‘RM’. Further equation (13c) indicates the change of cost of the system with respect to change in resource ‘Rclk’.
Design Flow
Embodiments described herein use the priority factor to organize the design space in increasing or decreasing order. Embodiments described herein may be directed to a design flow starting with the real specification and formulation, and eventually obtaining the register transfer level structure performing design space exploration. As an illustrative example, three parameters will be optimized during the following demonstration of design flow for high level synthesis; however, more than three parameters may be optimized and different combinations of parameters may be optimized. This illustrative example will be based on the following optimization parameters: power consumption, time of execution and hardware area of the resources.
Reference is first made to
System 10 may be implemented using a server which includes a memory store, such as database(s) or file system(s), or using multiple servers or groups of servers distributed over a wide geographic area and connected via a network. System 10 has a network interface for connecting to network in order to communicate with other components, to serve web pages, and perform other computing applications. System 10 may reside on any networked computing device including a processor and memory, such as an electronic reading device, a personal computer, workstation, server, portable computer, mobile device, personal digital assistant, laptop, smart phone, WAP phone, an interactive television, video display terminals, gaming consoles, and portable electronic devices or a combination of these. System 10 may include a microprocessor that may be any type of processor, such as, for example, any type of general-purpose microprocessor or microcontroller, a digital signal processing (DSP) processor, an integrated circuit, a programmable read-only memory (PROM), or any combination thereof. System 10 may include any type of computer memory that is located either internally or externally such as, for example, random-access memory (RAM), read-only memory (ROM), compact disc read-only memory (CDROM), electro-optical memory, magneto-optical memory, erasable programmable read-only memory (EPROM), and electrically-erasable programmable read-only memory (EEPROM), or the like. System 10 may include one or more input devices, such as a keyboard, mouse, camera, touch screen and a microphone, and may also includes one or more output devices such as a display screen and a speaker. System 10 has a network interface in order to communicate with other components by connecting to any network(s) capable of carrying data including the Internet, Ethernet, plain old telephone service (POTS) line, public switch telephone network (PSTN), integrated services digital network (ISDN), digital subscriber line (DSL), coaxial cable, fiber optics, satellite, mobile, wireless (e.g. Wi-Fi, WiMAX), SS7 signaling network, fixed line, local area network, wide area network, and others, including any combination of these.
Resource constraint module 12 is operable to define a plurality of resources constraints maxNR1, . . . maxNRn. Each resource constraint corresponds to a maximum number NRi, 1≦i≦n, of a each kind of resources R1, . . . Rn available to construct the system architecture, where n is an integer greater than 1 Examples of resources include adders, subtractors, clock oscillators, multipliers, dividers, comparators, Arithmetic Logic Units (ALU), integrators, summers and other functional modules. An example of a resources constraint is a maximum amount of each type of resource available to construct the system architecture.
Optimization parameter constraint module 14 is operable to define a constraint value for each of at least three optimization parameters for the system architecture. The at least three optimization parameters comprise a final optimization parameter. Examples of optimization parameters include hardware area, cost, time of execution, and power consumption. Other optimization parameters may also be considered.
Design space module 16 is operable to define a design space as a plurality of vectors representing different combinations of a number of each kind of resource R1, . . . Rn available to construct the system architecture. Each vector of the design space is of the form:
Vn=(NR1, . . . NRn)
wherein, NR1 represents the number of the kind of resource R1, NRn represents the number of the kind of resource Rn; and wherein based on resource constraints, 1≦NR1≦maxNR1, . . . 1≦NRn≦maxNRn, wherein max NR1 is a maximum number of the kind of resource R1 . . . maxNRn is a maximum number of the kind resource Rn.
For each of the plurality of optimization parameters, priority factor module 18 is operable to define a priority factor function for each kind of resource R1, . . . Rn. A priority factor function defines a rate of change of the optimization parameter with respect to a change in a number NRi of the corresponding kind of resource Ri, 1≦i≦n Examples of priority factor functions are illustrated herein in relation to hardware area, execution time, cost, and power consumption. Other priority factor functions may also be used by system 10 for these optimization parameters, and for other optimization parameters.
Satisfying set module 20 is operable to determine a plurality of satisfying, or near satisfying, sets of vectors. Satisfying set module 20 is operable to determine the satisfying sets of vectors, by, for each of the optimization parameters except for the final optimization parameter:
Intersection module 22 is operable to determine a set of vectors based on an intersection of the plurality of satisfying sets of vectors for the optimization parameters.
Selection module 24 is operable to select a vector for use in developing the system architecture by, for the final optimization parameter:
System architecture module 26 is operable to develop the system architecture using the selected vector.
Referring now to
Problem Formulation and Technical Specifications
At step 102, system 10 defines a plurality of resources constraints maxNR1, . . . maxNRn. Each resource constraint corresponds to a maximum number NRi, 1≦i≦n, of a each kind of resources R1, . . . Rn available to construct the system architecture, where n is an integer greater than 1.
This stage marks the beginning of the high level synthesis design flow starting with the problem description and technical specifications. The application may be properly defined with its associated data structure. These specifications will act as the input information for the high level synthesis tools. As an illustrative example, the following are example resource constraints:
Maximum resources available for the system design:
The following specifications are also assumed as an example for each resource available for constructing the system architecture:
At step 104, system 10 defines a constraint value for each of at least three optimization parameters for the system architecture. As an illustrative example, the following are example resource constraints:
The at least three optimization parameters comprise a final optimization parameter. In this example, the final optimization parameter is hardware area, with a constraint value being a minimum value while satisfying the other constraints for the other optimization parameters. The final optimization parameter provides a frame of reference to evaluate the set of vectors in order to select a vector for developing the system architecture, as will be explained herein.
During the problem formulation stage for high level synthesis the mathematical model of the application may be used to define the behavior of the algorithm. The model suggests the input/output relation of the system and the data dependency present in the function. For this illustrative example, a transfer function of an IIR Butterworth filter is used to demonstrate the high level synthesis design flow. The choice of IIR Butterworth filter is arbitrary and any other filter can also be used. The selected filter is used as an example benchmark application. The transfer function of a second order IIR digital Butterworth filter function can be given as:
Where H (z) denotes the transfer function of the filter in the frequency domain and x(n), x(n−1), x(n−2), x(n−3) represent the input variables for the filter in time domain, y(n) and y(n−2) represent the present output of the filter and the previous output of the filter in the time domain, and ‘z’ represents the unit delay operator. For simplicity in explanation, constants 0.167, 0.5 and 0.33 are denoted with ‘A’, ‘B’ and ‘C’ respectively.
In accordance with embodiments described herein, system 10 is operable to validate the constraint values for the optimization parameters.
System 10 performs this validation step as a first screening level of check by performing a Minimum-Maximum evaluation for the constraint to verify whether the constraints specified are valid and feasible.
System 10 is operable to perform this validation using the following example inputs: Module Library, Data Flow Graph (or Mathematical function) of the application and constraints values. System 10 is operable to produce the following output: the decision whether the design process continues or terminates (i.e. constraints are valid or invalid). System 10 is operable to perform the validation according to the following algorithm:
Creation of a Random Architecture Design Space for Power Consumption Parameter
At step 106, system 10 defines a design space as a plurality of vectors representing different combinations of a number of each kind of resource R1, . . . Rn available to construct the system architecture, wherein each vector Vn of the design space is of the form:
Vn=(NR1, . . . NRn)
wherein, NR1 represents the number of the kind of resource R1, NRn represents the number of the kind of resource Rn; and wherein based on resource constraints, 1≦NR1≦maxNR1, . . . 1≦NRn≦maxNRn, wherein maxNR1 is a maximum number of the kind of resource R1, . . . maxNRn is a maximum number of the kind resource Rn;
This initial arrangement can be made in any order and is used by system 10 to visualize the total architectural variants available. The design space can change based on the resources of available to construct the system architecture. The design space is first created according to the resource constraints for total available resources available to construct the system architecture.
For this illustrative example, the variable Vn=(NR1, NR2, NR3) is used to represent the architecture design space. The variables NR1, NR2 and NR3 indicate the number of adders/subtractors, multipliers and clock frequencies. According to the resource constraints, 1<=NR1<=3, 1<=NR2<=4 and 1<=NR3<=2. Table 1 shows the design space represented as different combinations of available resources, which are adder/subtractor, multiplier and clock for this example.
Calculation of the Priority Factor for Each Available Resource
At step 108, for each of the plurality of optimization parameters, system 10 defines a priority factor function for each kind of resource R1, . . . Rn. A priority factor function defines a rate of change of the optimization parameter with respect to a change in a number NRi of the corresponding kind of resource Ri, 1≦i≦n. As described herein the priority factors may be defined by applying a partial derivative to an equation representing an optimization parameter.
An optimization parameter may be a hardware area of a total number of all kinds of resources R1, . . . Rn. For the hardware area, the priority factor function of each kind of resource R1, . . . Rn is an indicator of a change of area contributed by a change in the number of the kind of resource Ri, wherein 1≦i≦n. For the hardware area, the priority factor for each kind of resource R1, . . . Rn that is not a clock oscillator may be calculated from NRi, ΔNRi, KRi wherein NRi is the number of the kind of resource Ri, KRi is an area occupied by the kind of resource Ri, ΔNRi·KRi is a change of area contributed by the kind of resource Ri, wherein Ri is a member of the kinds of resources R1, . . . Rn. As an example, for the hardware area, the priority factor for each kind of resource R1, . . . Rn that is not a clock oscillator may be of the form:
For the hardware area, the priority factor function of resource Ri that is a clock oscillator may be calculated from ΔA(Rclk), NRclk, Rclk, wherein Rclk is a clock oscillator used to construct the system architecture, ΔA(Rclk) is a change of area occupied by clock oscillators, NRclk is a number of clock oscillators. As an example, for the hardware area, the priority factor function of resource Ri that is a clock oscillator may be of the form:
Another optimization parameter may be a time of execution of a total number of all kinds resources R1, . . . Rn. For the time of execution, the priority factor function for each kind of resource R1, . . . Rn may be a function of the rate of change of a cycle time with a change in the number NRi of the kind of resources Ri at a maximum clock period, wherein 1≦i≦n and Ri is a member of the kinds of resources R1, . . . Rn. The priority factor function for the time of execution of the resources R1, . . . Rn that is not a clock oscillator may be calculated by NRi, TRi, Tpmax, wherein NRi is the number of the kind of resource Ri, TRi a number of clock cycles required by the kind of resource Ri to finish each operation, Tp is the time period of the clock, Tpmax is the maximum clock period. As an example, the priority factor function for the time of execution of the resources R1, . . . Rn that is not a clock oscillator may be of the form:
For the time of execution, the priority factor function of resource Ri that is a clock oscillator may be calculated by Rclk, NRi, TRi, Rclk, NRclk, where Rclk is a clock oscillator used to provide necessary clock frequency to the system, NRi is the number of the kind of resource Ri, NRclk is the number of clock oscillators, TRi a number of clock cycles required by the kind of resource Ri to finish each operation. As an example, for the time of execution, the priority factor function of resource Ri that is a clock oscillator may be of
Another example optimization parameter is a power consumption of the resources R1, . . . Rn. For the power consumption, the priority factor function for each kind of resource R1, . . . Rn may be a function of a change in power consumption per unit area due to deviation of clock frequency from maximum to minimum and a change in the number NRi of the kind of resource Ri at maximum clock frequency, wherein 11≦i≦n, and Ri is a member of the kinds of resources R1, . . . Rn. The priority factor function for the power consumption of the resources R1, . . . Rn that is not a clock oscillator may be calculated by NRi, KRn, ΔNRi, (pc)max, pc wherein NRi is the number of resource Ri, KRn is an area occupied by resource Ri, ΔNRn·KRn is a change of area contributed by resource Ri, pc is power consumed per area unit resource at a particular frequency of operation, (pc)max is power consumed per area unit resource at a maximum clock frequency. As an example, the priority factor function for the power consumption of the resources R1, . . . Rn that is not a clock oscillator may be of the form:
For the power consumption, the priority factor function of resource Ri that is a clock oscillator may be calculated by NRi, TRi, Rclk, NRclk, pc where Rclk, is a clock oscillator used to provide necessary clock frequency to the system, NRi is the number of the kind of resource Ri, TRn a number of clock cycles required by resource Ri to finish each operation, pc is power consumed per area unit of resource at a particular frequency of operation. As an example, for the power consumption, the priority factor function of resource Ri that is a clock oscillator may be of the form:
As another example, an optimization parameter may be a total cost of the total number of all kinds resources R1, . . . Rn. For the total cost, the priority factor function for each kind of resource R1, . . . Rn may be an indicator of change in total cost of the total number of all kinds resources R1, . . . Rn with respect to a change in the number of the kind of resource Ri and the cost per unit resource, wherein 1≦i≦n. For the cost, the priority factor function for each kind of the resources R1, . . . Rn that is not a clock oscillator may be calculated by NRi, KRi, ΔNRi, CRi, wherein NRi is the number of the kind of resource Ri, KRi is an area occupied by the kind of resource Ri, ΔNRi·KRi is a change of area contributed by the kind of resource Ri, CRi is the cost per area unit of the kind of resource Ri. As an example, for the cost, the priority factor function for each kind of the resources R1, . . . Rn that is not a clock oscillator may be of the form:
For the cost, the priority factor function of resource Ri that is a clock oscillator may be calculated by Rclk, NRclk, ΔA(Rclk), CRclk, wherein Rclk is a clock oscillator used to provide necessary clock frequency to the system, ΔA(Rclk) is a change of area occupied by clock oscillators, Nclk is a total number of clock oscillators available to construct the system architecture, CRclk is the cost per area unit of clock oscillators. As an example, for the cost, the priority factor function of resource Ri that is a clock oscillator may be of the form:
At step 110, system 10 determines a plurality of satisfying sets of vectors, and in particular, system 10 determines a satisfying set for each optimization parameter, except for the final optimization parameter. Referring now to
For this example, method 110 will first be illustrated for the example optimization parameter power consumption, in order to determine a satisfying set for the optimization parameter power consumption.
At step 130, for each kind of resource R1, . . . Rn available to construct the system architecture, system 10 calculates a priority factor using the corresponding priority factor function for the optimization parameter.
Using the example priority factor functions for power consumption described herein, the following priority factors are calculated for each resource available to construct the system architecture.
For resource adder/subtractor (R1):
For resource multiplier (R2):
For resource clock oscillator (Rclk):
The above priority factors are a measure of the change in power consumption with the change in number of a specific resource. For example, according to the above analysis, the change in clock frequency from 50 MHz to 200 MHz affects the change in power the most, while the change in number of adder/subtractor affects the change in power consumption the least. Similarly, the change in number of multipliers influences the change in power consumption more than the adder/subtractor but less than the clock.
At step 132, system 10 determines a priority order by sorting the calculated priority factors based on a relative magnitude of the calculated priority factors.
For this example, according to the priority factors calculated, the priority order (PO) is arranged so that the resource with the lowest priority factor is assigned the highest priority order while the resource with the highest priority factor is assigned the lowest priority order. The priority order of the resources increases with the decrease in priority factor of the resources. Therefore the following PO of the resources for arranging the design variants in increasing order can be attained, where “>” means precedes:
PO(R1)>PO(R2)>PO(Rclk)
Based on the above PO the variant vectors from the design space are chosen so that the design space for power consumption can be organized in increasing orders of magnitude. That is, the PO is used to arrange the vectors of the design space in increasing order. The arrangement of the variant vectors in the design space in increasing order will help to prune the design space for obtaining the satisfying set for power consumption.
Arrange the Design Space According to the Priority Order
At step 134, system 10 generates an ordered list of vectors by sorting the plurality of vectors of the design space based on the priority order.
Since the design space is large for the present generation of complex multi objective VLSI designs, finding the system architecture that best meets the specified design objectives by analyzing the design space exhaustively may be strictly prohibitive. Due to increased complexity in VLSI designs the major problem has been the examination of the design variants in the large design space for selecting a design option that is acceptable in terms of all the constraints and predefined specifications. Hence obtaining a superior quality design for the user specified specification requires a structured methodology for exploring the large design spaces. Design space exploration when performed at the higher level of abstraction pays more dividend than performing it at the lower level of abstraction like the logic or the transistor level. The job of design space exploration is apparently a battle between optimizing the following two contradictory conditions: selecting the optimum design option and efficiently searching the space in a short time. Hence, there may be a tradeoff between not only the contradictory parameters of optimization during high level synthesis design, but also between the above mentioned conditions during design space exploration in high level synthesis. To proficiently analyze the complex design spaces, an efficient means of arriving at the best result is needed. Analyzing the design to obtain the best architecture according to the requirement specified requires an efficient design space exploration technique.
Referring now to
Let NRi represent the number of a particular resource Ri, and at step 302, the initial number of all kinds of resources is set to one, NR1, . . . NRn=1.
Let position ‘p’ represent the position where a particular variant vector is located within the arranged design space, and at step 304, the position ‘p’ is set to one (p=1) and the variant vector (NR1, . . . NRn) is assigned to position ‘p’.
Let ‘i’ be an index, and at step 306, ‘i’ represents the resource whose PO is maximum.
Let NRi max (also referred to herein as maxNRi) represent the maximum number of a kind of resource Ri, at step 308, it is determined whether NRi=NRi max.
If NRi does not equal NRi max, then at step 310, NRi is increased by one.
At step 312, variant vector (NR1, . . . NRn) is assigned to position ‘p+1’.
At step 314, position ‘p’ is increased by one, p=p+1.
Let p(final) be the final position according to the maximum number of design options available, which is the number of vectors of the design space, and at step 316, it is determined whether p=p(final).
If p does not equal p(final) then method 300 returns to step 306.
If p does equal p(final) then method 300 proceeds to step 318 and ends.
If NRi equals NRi max, then at step 320, NRi is reset to one.
At step 322, i represents the next resource with the next higher PO, and the method returns to step 308.
Referring now to
Determination of the Border Variant for the Power Consumption
Referring back to
After the vectors are arranged in increasing order to generate the ordered list of vectors, the design space is pruned to obtain the border variant for power consumption and the satisfying set of vectors for power consumption. As an example, binary search is applied to the design space shown in
P7 < Poptimal,
The obtained variants are further analyzed for power consumption according to equation (26). ‘Poptimal’ is the value of power consumed that is specified as a constraint at the beginning of the design flow. ‘Pi’ is the value of power consumption for the vector#i. When the value of Pi is less than the value of specified Poptimal, then the southern portion (down) of the design space 30 (
Referring back to
Border variant is the last variant (represented as the combination of resources shown by vectors) of the architecture vector design space for hardware area/power consumption that satisfies the constraints. While on the contrary, border variant for execution time (or performance) in the architecture vector design space is the first variant that satisfies the constraint value for execution time.
The Border variant is determined as follows:
Algorithm
Apply the priority factor function obtained in equations (35)-(38) to determine the priority factor of each resource.
Determine the priority order based on the priority factor obtained.
Apply the priority order (PO) sequence to the algorithm proposed in
Apply Binary search mechanism (or other search) to search the Border variant value which is the user constraint for power consumption. Calculate the power consumption of each variant visited during searching using the function (27). During evaluation find out the last variant in the design space that has value equal or less that the power consumption constraint provided. The border variant for power consumption obtained after applying the algorithm above indicates that V21 (see
The binary search termination condition for power consumption/hardware area shown in table 2a may be implemented as follows:
Referring back to
At step 130, for each kind of resource R1, . . . Rn available to construct the system architecture, system 10 calculates a priority factor using the corresponding priority factor function for the time of execution.
Using the example priority factor functions for time of execution described herein, the following priority factors are calculated for each resource available to construct the system architecture.
For resource adder/subtractor (R1):
For resource multiplier (R2):
For resource clock oscillator (RcIk):
The factors determined above measure the change in time of execution with a corresponding change in the number of a specific resource. For instance, according to the above analysis the change in number of adder/subtractor affects the change in time of execution the least, while the change in clock frequency from 50 MHz to 200 MHz affects the change in time of execution the most. Similarly, the change in multiplier influences the change in execution time lesser than the change in clock frequencies.
At step 132, system 10 determines a priority order (PO) for generating the ordered list of vectors by arranging the design variants in increasing order according to the above priority factors calculated. The PO for time of execution is:
PO(R1)>PO(R2)>PO(Rclk)
Arrange the Design Space in Decreasing Order for Execution Time
At step 134, system 10 generates an ordered list of vectors using the priority order for time of execution. System 10 may generate an ordered list of vectors using method 300 of
Determination of the Border Variant for the Time of Execution Parameter
At step 136, system 10 determines a satisfying set of vectors from the ordered list of vectors. The arrangement of the design space as an ordered list of vectors in decreasing order allows the design space to be pruned for finding the border variant of time of execution. As discussed herein, binary search algorithm is beneficial when it comes to the question of searching a large size ordered list like a large design space but other search algorithms may also be used. This is because it is fast and works well for large size sorted lists of elements. Binary search finds the border variant at a complexity of log N in the sorted design space. The binary search algorithm is applied to the design space as an ordered list 50 shown in
The binary search termination condition for execution time for Table 2b may be as follows:
System 10 is operable to calculate performance (execution time) by determination of Latency and Cycle time in Table 2b.
The performance (execution time) metric of a variant is a combination of latency, cycle time and number of sets of data (N) to be pipelined during processing (see equation 13). Latency (L) is the delay for the first processing output while cycle time (To) is the difference in clock cycle between the outputs of any two consecutive sets of pipelined data. The performance definition is shown with an example of a filter benchmark and is determined as follows:
Example Benchmark:
y(n)=0.167x(n)0.5x(n−1)0.5x(n−2)+0.167x(n−3)−0.33y(n−2)
Let, 0.167 x(n)=Ai, 0.5 x(n−1)=Bi, 0.5 x(n−2)=Ci, 0.167 x(n−3)=Di
0.167x(n)+0.5x(n−1)=Fi,
0.5x(n−2)+0.167x(n−3)=Gi,
0.167x(n)+0.5x(n−1)+0.5x(n−2)+0.167x(n−3)=Hi,
0.33y(n−2)=Ji
0.167x(n)+0.5x(n−1)+0.5x(n−2)+0.167x(n−3)−0.33y(n−2)=Ki
Referring now to
Referring now to
Determination of the Pareto-Optimal Set of Design Architecture
Referring back to
The set of vectors may be referred to as the pareto-optimal set in some example embodiments. The set of vectors contains all those architectural variants that satisfy (or nearly satisfy) the constraints. Hence the process of analyzing the initial large design space is reduced to analyzing only the architectural variants in the pareto-optimal set. For the illustrative example just three vectors from each satisfying set of optimization parameters, power consumption and time of execution, simultaneously satisfy both power consumed and execution time. The vectors are V5, V13 and V21 (see
In accordance with some embodiments, system 10 is operable determine whether the constraint vectors is valid using the set of vectors. System 10 performs this constraints validation check by determining if the set of vectors is absolutely vacant. A vacant set of vectors signifies that the constraint values provided are too tight/strict. If so, the strict constraint values of the given optimization parameter need to be relaxed to a certain extent. The algorithm for used by system 10 to detect the problem and resolve is described below:
At step 114, system 10 selects a vector from the set of vectors using the final optimization parameter. The selected vector is for use in constructing the system architecture.
Referring now to
At step 140, for each kind of resource R1, . . . Rn available to construct the system architecture, system 10 calculates a priority factor using the corresponding priority factor function for the final optimization parameter. For this example, the priority factor for each resource (R1, R2, Rclk) is determined using equations (9)-(12) to arrange the vectors of the set of vectors (from the intersection of the satisfying sets) in increasing order, similarly to the way it was determined for power and execution time.
At step 142, system 10 determines a priority order by sorting the calculated priority factors based on a relative magnitude of calculated priority factors. After calculation of the priority factor for each resource, the priority order is determined. The obtained priority order is: PO(Rclk)>PO(R1)>PO(R2).
At step 144, system 10 generates an ordered list of vectors by sorting the set of vectors based on the priority order. For this example, system 10 arranges the vectors V5, V13, V21 of the set of vectors in increasing orders of magnitude.
At step 146, system 10 selects a vector from the set of vectors based on the ordered list of vectors for use in constructing the system architecture. The selected vector defines a combination of a number of each kind of resource available to construct the system architecture, which satisfies or nearly satisfies the constraint values for the optimization parameters and the resource constraints. If the vectors of the satisfying set obey the constraints or exceed the value by an acceptable amount, such as 5-10% for example, or something other configured amount, then the vectors are said to satisfy or nearly satisfy the constrain respectively. For this example, the design specification demanded minimum area overhead with simultaneous satisfaction of the constraints imposed, so the aim is to find the vector with minimum area overhead. After the arrangement of the vectors of the set of vectors, system 10 selects the vector V5=(1,1,2) (
System 10 is operable to determine and demonstrate the final variant (which satisfies all the three optimization parameters constraints values) from the intersection set of vectors. System 10 is operable to determine the final variant vector according to the following algorithm.
Algorithm
Referring back to
Additionally or alternatively, referring back to
Referring now to
The Scheduling of Operations Through the Sequencing and Binding Graph for the Selected Vector
At step 202, system 10 generates a sequencing graph and a binding graph based on the selected vector. System 10 uses the aid of a sequencing graph and a binding graph to represent the combination of a number of each kind of resource specified by the selected vector in the temporal and spatial domain. The flow of data elements through different operators in the data path can be visualized with the help of sequencing graphs. This graphical representation of the application can distinctly underline the operations in discrete time steps while maintaining the precedence constraints specified. Referring now to
Scheduling is a process that states the time slot for every operation while fixing the timing length (latency) in such a manner so that a synthesized hardware structure meets the timing restriction specified. A classical example of time constraint scheduling where the scheduler must achieve the goal with a minimum number of functional units possible to realize the behavior is shown. The scheduling of operations is performed based on the As Soon As Possible (ASAP) algorithm. Though many algorithms may be used for scheduling operations such as the As Late as Possible (ALAP), List scheduling, Force Directed scheduling, ASAP, and so on. ASAP was selected because the operations should be done as soon as the resources R176 and R278 become free. As the processed data is ready the prepared data from the previous stage is used for the next operation. The binding graph will be used in further design stages to realize the function used as a benchmark application for demonstration of the optimized high level synthesis design flow.
Sequencing and Binding Graph with Data Registers
Referring back to
Determination of the Multiplexing Scheme
Referring back to
Development of the Multiplexing Scheme
System 10 is operable to develop the multiplexing scheme table (MST) from the scheduling step (Sequencing Graph with data registers) by implementing the following algorithm.
System 10 is operable to implement the algorithm mentioned above to create table 3 (multiplexing table for adder/subtractor):
Similarly, Table 4, the multiplexing scheme table for multiplier can be obtained.
At step 208, system 10 generates a system block diagram. After the multiplexing scheme has been successfully performed, the next phase of the design flow is the development of the system block diagram. The system block diagram comprises two divisions, data path circuit and the control unit. The data path is responsible for the flow of data through the buses and wires after the operations have been performed by the components present in the data path circuit. Thus, the data path provides the sequence of operations to be performed on the arriving data based on the intended functionality. As an example, the data path can comprise registers for storage of data, memory elements such as latches for sinking of data in the next stage, as well as multiplexers and demultiplexers for preparation of data at run time by change of configuration. The data path circuit also consists of functional resources which are accountable for performing the operations on the incoming data. The block diagram for the benchmark application consists of two resources (an adder/subtractor and a multiplier) for executing their respective assigned operations. Another component of the system block diagram is the control unit or the controller. A centralized control unit controls the entire data path circuit and provides the necessary timing and synchronization required by data traversing through the data path circuit. The control unit acts as a finite state machine that changes its state according to the requirement of activating and deactivating the various elements of the data path at different instances of time. Based on the multiplexing scheme the block diagram of the data path circuit was constructed to demonstrate design flow for the benchmark application.
Referring now to
At step 210, system 10 generates a RTL level representation of the system architecture. System 10 is operable to create the RTL data path circuit diagram from Multiplexing Scheme as follows.
The Block diagram of the RTL data path circuit in
Algorithm
Let the number of variables available for INPUT 1 of the multiplexing scheme table for resource R1 be denoted as Vx. Therefore, from the multiplexing scheme table 3 for adder/subtractor, Vx=4, since there are 4 possible input variables (R2out, R2out, R2out and R2out) for INPUT 1 of adder/subtractor.
Let the number of possible variables available for INPUT 2 of the multiplexing scheme table for resource R1 be denoted as Vy. Therefore, from the multiplexing scheme table 3 for adder/subtractor, Vy=4, since there are 4 possible input variables (RegP, R1out, R1out and R1out) for INPUT 2 of adder/subtractor.
Based on the value of Vx=4, a 4-bit multiplexer (MUX 1) component is adopted from the module library. The inputs to the 4-bit MUX would be the 4 possible variables acting as inputs for INPUT 1 which are in this case R2out, R2out, R2out and R2out as mentioned in step 1.
Similarly, based on the value of Vy=4, a second 4-bit multiplexer (MUX 2) component is again adopted from the module library. The inputs to this second 4-bit MUX would be the 4 possible variables acting as inputs for INPUT 2 which are in this case RegP, R1 out, R1 out and R1 out as mentioned in step 1. Selector signals are assigned to each multiplexer which selects different inputs based on the information of the select lines.
If the multiplexers obtained in step 2 are an N bit multiplexer then input storage elements are needed for each case to store the data from different inputs at different time instances.
Since for the applications discussed as example, the multiplexers (MUX 1 and MUX 2) obtained in step 2 are 4 bit Multiplexers hence there will be a sharing of the same mux unit at different time instances. This mandates the incorporation of storage latches to temporarily hold the data for various inputs until needed by the next component. Therefore, for each multiplexer formed in step 2, a corresponding latch is added. Strobe signals are assigned to each input latch which latches the data when needed. Thus this strobe maintains the synchronization process.
Elseif the multiplexers obtained in step 2 is a 1 bit multiplexer then no input storage element is needed in the design.
Followed by the latch component, the main functional unit has to be added from the library which actually processes the data based on the inputs received at different time instances by the two multiplexers through the corresponding storage latches. In this case, the outputs of two latches act as inputs of the adder/subtractor resource. Hence, the same adder/subtractor resource performs the same functional operation but on different inputs at different time instances (as received from the latches). Enable signal is assigned to functional unit (resource) which activates the resource when both the inputs are ready. Thus enable also maintains the synchronization.
Now since the same adder/subtractor resource performs the same functional operation but on different inputs at different time instances, hence an output storage latch needs to be incorporated in the data path unit of the RTL circuit. This output storage latch holds different data from the functional resource based on the different outputs processed by the functional unit. Output Strobe signals is assigned to the output latch which latches the data when needed. Thus this strobe is also responsible for the synchronization process.
If an output storage latch is present in the data path unit, then a demultiplexer has to be added in the data path unit. This is because, based on the data stored by the output latch due to different output processing from functional unit, a structure is needed that can produce the output of all the data from the latch through parallel wires. Hence an N-bit demultiplexer is needed. In the case of example discussed so far, the value of the N bit width of the demultiplexer=value of N bit width of any input multiplexer. De-selector signal is assigned to the demultiplexer which outputs the different results of the latch through different wires.
Elseif an output storage latch is not present in the data path unit then no demultiplexer needs to be added in the data path unit.
This process is repeated for all the multiplexing tables developed in the design process. In this case, the steps from 1-6 was repeated for multiplexing table for multiplier resource (R2).
Once all the connections mentioned in step 1-7 for the discrete components for a specific resource is complete, then the outputs of each resource stage are connected to the inputs of the other resource stage based on the information present in multiplexing scheme tables. For example, in this case, input ‘R1 out’ of resource R1 results from the output ‘R1in’ of the same resource. Also, output ‘R1in’ of the second resource R2 acts as the input of the first resource R1in the form of ‘R2out’ through the second MUX. Similarly, using the information from the multiplexing tables, the interconnected components of each resource is connected among each other to obtain a circuit in
Development of the Centralized Control Unit with Timing Specification
Referring back to
System 10 is operable to implement the following procedure for determination of the controller table.
The procedure for determination of controller table is to identify the various control signals that control the different components of the data paths. With the developed block diagram in the previous step (see the algorithm for development of block diagram of the data path from multiplexing scheme table), there can be ‘m’ # of control signals depending on the complexity of the data path unit architecture. For the example used, after development of the block diagram of the data path unit for the example as shown in
Schematic Structure Development for the Whole System
At step 214, system 10 simulates the schematic structure for testing and verification and then implements the schematic structure of the device may be developed in any of the synthesis tools available. Examples include Synopsys, Xilinx Integrated Software, Environment (ISE) and Altera Quartus II. For this example, components in the data path may be described and implemented in VHDL before verification. Then, as an example, the schematic structure of the whole device may be designed and implemented in Xilinx Integrated Software Environment (ISE) version 9.2i. Referring now to
Analysis and Results
For determination of a system architecture based on the selected vector, design space exploration may require elaborate analysis and evaluation of the architectural variants (design points). Before selecting the vector specifying the combination of resources to use for the system architecture, the border variants of architecture for the optimization parameters (execution time and area/power, for example) are found separately. In an example, binary search conducted on the arranged design space (increasing or decreasing) leads to the border variant, taking into account the operating constraints for execution time and area/power separately. Other search algorithms can also be used. The proposed DSE approach uses binary search after the arrangement of the design space using the priority factor method. The search of the optimal architecture requires only
Where ‘n’=number of type of resources and ‘vRi’ is the number of variants of resource ‘Ri’. On the contrary, the exhaustive search checks for
architectural variants during optimal architecture while satisfying all operating constraints. In this design space exploration approach and in the design flow three optimization parameters have been used for optimization, but additional optimization parameters may also be used, such as cost for example. In one example, the execution time and power are the parametric constraints and area is the final optimization
parameter. Hence, the searching has to be repeated for both optimizations parameters to determine the border variant.
Therefore the total number of architecture evaluations using exhaustive search is given as
And total number of architecture evaluations using the proposed method is given
Here, ‘M’ denotes each performance parameter. In this case the value of ‘M’ is two because there are two performance parametric constraints. The proposed approach was applied on various realistic benchmarks to check the acceleration obtained through this DSE method. Results indicated massive acceleration in the speedup compared to the exhaustive approach. The results of proposed design space exploration framework for the realistic benchmarks are illustrated in Table 6.
Referring now to
System 10 is operable to verify the device or integrated circuit, designed through the high level synthesis design flow, for its accurate functionality. System 10 is further operable to import the design in a Synopsys tool for flattening of the circuit. After flattening, system 10 is operable to execute the steps needed for floorplanning, power planning, placement and routing.
System 10 is capable of resolving the conflicting objectives in DSE by concurrently maximizing the accuracy in evaluation of the design point and minimizing the time expended for design space assessment. This approach is applicable to all system architectures based on modules with known performance requirements and system specifications. Formalizing the design methodology for multi parametric HLS may be useful for many industrial projects and modern automated high level synthesis tools.
As another example, the following illustrates an application of the embodiments to hardware area and execution time as maximum constraints values, power as a minimum constraint value.
As shown in
At the problem formulation stage of the high level synthesis the mathematical model of the application is used to define the behavior of the algorithm. The model suggests the input/output relation of the system and the data dependency present in the function. As an example, second order IIR digital filter function is used to demonstrate the high level synthesis design flow has been used. The transfer function of a second order IIR digital filter function can be given as:
For this example, x(n), x(n−1) and x(n−2) are the input vector variables for the function. The previous outputs are given by y (n−1) and y (n−2), while the present output of the function is given by y(n). For simplicity, constants 0.041, 0.082, 0.6743 and 1.4418 have been denoted as ‘A’, ‘B’, ‘D’ and ‘E’ respectively. While x(n), x(n−1), x(n−2), y (n−1) and y (n−2) are denoted by Xn, Xn1, Xn2, Yn1 and Yn2 respectively.
Referring to
The design space shown in Table 8 below illustrates the different combinations of the resources available during system design, viz. adder/subtractor and multiplier.
At step 108, for each optimization parameter, system 10 defines a priority factor function for each kind of resource available to construct the system architecture. Example priority factor functions are described herein.
At step 110, for each optimization parameter except the final optimization parameter, system 10 is operable to implement method 110 illustrated in
At step 130, for hardware area, system is operable to calculate the priority factor for each available resource and arrangement of the PF in increasing order for area.
For resource adder/subtractor (R1):
For resource multiplier (R2):
For resource clock oscillator (Rclk):
The above factors are a true measure of the change in area with the change in number of a specific resource. For example, according to the above analysis the change in number of multipliers affects the change in area the most. While the change in clock frequency from 24 MHz to 400 MHz influences the change in area the least.
At step 132, system 10 determines the priority order (PO) according to the priority factors calculated. For this example, the PO is arranged so that the resource with the lowest priority factor is assigned the highest priority order while the resource with the highest priority factor is assigned the lowest priority order. The priority order of the resources increases with the decrease in priority factor of the resources. Therefore the following PO of the resources is obtained for arranging the design variants in increasing order.
PO(Rclk)>PO(R1)>PO(R2)
At step 134, system 10 generates an ordered list of vectors based on the above priority such that the design space for area can be organized in increasing orders of magnitude. This will help prune the design space for obtaining the border variant for area. An ordered set of vectors 71 is shown in
Determination of the Border Variant for the Area
At step 136, system 10 determines a satisfying set (or nearly satisfying set) by applying a binary search to the ordered set of vectors 71 shown in
As shown in
For the next optimization parameter, time of execution, system 10 returns to step 130 to calculate the priority factor for each available resource for time of execution.
For resource adder/subtractor (R1):
For resource multiplier (R2):
For resource clock oscillator (Rclk):
The factors determined above indicate a measurement of the change in time of execution with the change in number of a specific resource. For instance, according to the above analysis the change in number of adder/subtractor affects the change in time of execution the least, while the change in clock frequency from 24 MHz to 400 MHz affects the change in time of execution the most. Similarly, the change in multiplier influences the change in execution time less than the change in clock frequency.
At step 132, system 10 determines the following priority order (PO) for arranging the design variants in increasing order according to the priority factors calculated above.
PO(R1)>PO(R2)>PO(Rclk)
At step 134, system 10 generates an ordered list of vectors based on the PO. Referring now to
At step 136, system 10 determines the satisfying set by determining the border variant for the time of execution parameter. The binary search algorithm is applied to the design space (or ordered list of vectors). After applying the binary search, the variants that are analyzed in the process according to equation (13) to determine the best variant is shown in Table 10. Analysis reveals that variant number ‘V5’ is the border variant for the ‘time of execution’ parameter. Hence all the design variants to the south of the vector space must satisfy the constraint imposed by the user.
At step 112, system 10 determines a set of vectors based on the intersection of the satisfying sets of vectors for area and time of execution.
At step 114, system 10 selects a vector from the set of vectors, as illustrated by the flowchart in
At step 140, system 10 calculates the priority factor for the power consumption parameter is determined according to equations (25)-(28) to arrange the variants of the set of vectors in increasing order, similar to the way the priority factor for area and execution time were determined. At step 142, after calculation of the PF the priority order is determined. The obtained priority order is: PO (R1)>PO (R2)>PO (Rclk)
Variants V5, V7, V16, V25 and V8 belong to the set of vectors. The variants V5, V7, V16, V25 and V8 are arranged in increasing order for power consumption. According to the specification provided, the variant with the minimum power consumption should be selected. At step 144, system generates an ordered list of vectors based on the PO and at step 146 selects a vector. In this example, variant ‘V5’ is selected as it represents a combination of resources with the minimum power consumption. Therefore variant number ‘V5’ is the only variant from the whole design space consisting of 27 variants that optimizes concurrently hardware area, power consumption and time of execution while meeting all the specifications provided.
As an additional example, system 10 may receive cost of resources and execution time as maximum constraints, and power as a minimum constraint. System 10 is operable to implement the function of the second order digital IIR Chebyshev filter is given in (1d).
y(n)=0.041x(n)+0.082x(n−1)+0.041x(n−2)−0.6743y(n−2)+1.4418y(n−1) (1d)
Where x(n), x(n−1) and x(n−2) are the input vector variables for the function. The previous outputs are given by y (n−1) and y(n−2), while the present output of the function is y(n).
System 10 calculates priority factors the cost of each resource and determines a priority order (consisting of resources) in increasing orders for cost
The PF of the different resources for cost model is given as:
Based on the PF calculated for cost model, the architecture vector space for cost can be constructed. This architecture tree uses a new topology in this paper for design space arrangement. The vector space topology used enable quick arrangement of the design space which is further used for searching the border variant. The topology adds another dimension as it does not require any special algorithm to arrange the design space. Just mere construction of the architecture vector space based on the calculated PF for a parameter ensures that the design space has become sorted in increasing/decreasing orders of magnitude. The architecture vector space comprising of the design space becomes automatically arranged in increasing orders of magnitude for the cost model.
Therefore applying binary search on the sorted design space for cost yields the border variant in just few comparisons. The border variant for cost is the last variant in the design space which satisfies the constraint for cost specified. The border variant obtained for cost is ‘V19’.
System 10 determines an ordered set by arrangement of the design space in decreasing orders in the form of architecture vector space for execution time.
The PF of the different resources used in system design for execution time model is given below:
System 10 determines an ordered set by arrangement of the design space in decreasing orders in the form of architecture tree for execution time. The job is to just hold the data temporarily until being used by some functional unit in next clock cycles. Therefore the change in number of memory elements does not directly affect the execution time (e.g. the execution time does not change regardless of whether the there are two separate registers at different time slots or both are binded together to act as one single register to save chip area). Hence, TRMMax=TRMMin. Hence, PF(RM)=0.
Based on the PF calculated for execution time, the architecture vector space for execution time is also constructed. Hence, the obtained architecture vector space after construction is now also automatically arranged (sorted) in decreasing orders of magnitude. After arrangement, binary searching is applied in order to find the border variant for execution time. The border variant for execution time is the first variant in the design space which satisfies the constraint for cost specified. The border variant obtained is variant ‘V19’. After the border variant for both cost and execution time is obtained, the Pareto optimal set is obtained. Then the architecture vector space for power consumption is also similarly constructed as explained before using the PF function, in increasing orders of magnitude for power consumption Among the variants of the Pareto set, the variant, which appears first in the ascending ordered sorted design space, is the one with the minimum power consumption and concurrently satisfies the constraints for cost, execution time and power consumption (specified in Table 11) for the design problem.
The present invention has been described here by way of example only. Various modification and variations may be made to these exemplary embodiments without departing from the spirit and scope of the invention, which is limited only by the appended claims.
Number | Name | Date | Kind |
---|---|---|---|
6351151 | Kumar et al. | Feb 2002 | B2 |
6740958 | Nakazato et al. | May 2004 | B2 |
6954916 | Bernstein et al. | Oct 2005 | B2 |
7234121 | Zhu et al. | Jun 2007 | B2 |
7657416 | Subasic et al. | Feb 2010 | B1 |
7743289 | Furuta et al. | Jun 2010 | B2 |
20060275995 | Furuta | Dec 2006 | A1 |
20070011575 | Koktan et al. | Jan 2007 | A1 |
Entry |
---|
P. Arató, Z. Å. Mann, and A. Orbán, “Time-constrained scheduling of large pipelined datapaths,” Journal of Systems Architecture, vol. 51, No. 12, pp. 665-687, 2005. |
G. De Micheli, Synthesis and Optimization of Digital Circuits, McGraw-Hill, Boston, Mass, USA, 1994. |
I. Das. A preference ordering among various Pareto optimal alternatives. Structural and Multidisciplinary Optimization, 18(1):30-35, Aug. 1999. |
Christian Haubelt, Jurgen Teich,“Accelerating Design Space Exploration Using Pareto-Front Arithmetic's”, In Proceedings of Asia and South Pacific Design Automation Conference (ASP-DAC'03), Japan, 2003. |
J. C. Gallagher, S. Vigraham, and G. Kramer,“A family of compact genetic algorithms for intrinsic evolvable hardware,” IEEE Trans. Evolutionary Computation., vol. 8, No. 2 , pp. 1-126, Apr. 2004. |
Vyas Krishnan and Srinivas Katkoori, A Genetic Algorithm for the Design Space Exploration of Datapaths During High-Level Synthesis, IEEE Transactions on Evolutionary Computation, vol. 10, No. 3, Jun. 2006. |
E. Torbey and J. Knight, “High-level synthesis of digital circuits using genetic algorithms,” in Proc. Int. Conf. Evol. Comput., pp. 224-229, May 1998. |
E. Torbey and J. Knight, “Performing scheduling and storage optimization simultaneously using genetic algorithms,” in Proc. IEEE Midwest Symp. Circuits Systems, pp. 284-287, 1998. |
Giuseppe Ascia, Vincenzo Catania, Alessandro G. Di Nuovo, Maurizio Palesi, Davide Patti, “Efficient design space exploration for application specific systems-on-a-chip” Journal of Systems Architecture 53, pp. 733-750, 2007. |
A.C.Williams, A.D.Brown and M.Zwolinski,“Simultaneous optimisation of dynamic power, area and delay in behavioural synthesis”, IEE Proc.-Comput. Digit. Tech, vol. 147, No. 6, pp. 383-390, Nov. 2000. |
Christian Haubelt, Thomas Schlichter, Joachim Keinert, Mike Meredith, “SystemCoDesigner: automatic design space exploration and rapid prototyping from behavioral models”, Proceedings of the 45th annual ACM IEEE Design Automation Conference, pp. 580-585, 2008. |
Xuejie Zhang and Kam W. Ng, “A review of high-level synthesis for dynamically reconfigurable FPGAs”, Microprocessors and Microsystems, Elsevier, vol. 24, Issue 4, pp. 199-211, Aug. 1, 2000. |
C. Mandal, P. P. Chakrabarti, and S. Ghose, “GABIND: A GA approach to allocation and binding for the high-level synthesis of data paths,” IEEE Transaction on VLSI, vol. 8, No. 5, pp. 747-750, Oct. 2000. |
M. J. M. Heijlingers, L. J. M. Cluitmans, and J. A. G. Jess, “High-level synthesis scheduling and allocation using genetic algorithms,” in Proc.Asia South Pacific Design Automation Conf., pp. 61-66, 1995. |
M. K. Dhodhi, F. H. Hielscher, R. H. Storer, and J. Bhasker, “Datapath synthesis using a problem-space genetic algorithm,” in IEEE Trans.Comput.-Aided Des., vol. 14, pp. 934-944,1995. |
S. Brown et al. Fundamentals of digital logic with VHDL design. 2nd ed. New York, NY: McGraw-Hill: 2005. p. 940. |
Saraju P. Mohanty, Nagarajan Ranganathan, Elias Kougianos and Priyadarsan Patra, “Low-Power High-Level Synthesis for Nanoscale CMOS Circuits” Chapter- High-Level Synthesis Fundamentals, Springer US, 2008. |
D. Gajski, N. Dutt, A.Wu, and S. Lin, High Level Synthesis: Introduction to Chip and System Design. Norwell, MA: Kluwer, 1992. |
P. G. Paulin and J. P. Knight, “Force-directed scheduling for the behavioral synthesis of ASICs,” IEEE Trans. Comput.-Aided Des., vol. 8, No. 6, pp. 661-679,1989. |
Zhipeng Zeng et al. A Novel Framework of Optimizing Modular Computing Architecture for multi objective VLSI designs, 2009 International Conference on Microelectronics, pp. 322-325. |
Anirban Sengupta et al. “A high level synthesis design flow with a novel approach for efficient design space exploration in case of multi-parametric optimization objective”, Microelectronics Reliability 50 (2010) 424-437. |
I. Kirischian et al. Multi-parametric optimisation of the modular computer architecture, Int. J Technol Policy Manage 2006;6(3):327-46. |
G. Alessandro et al. Fuzzy decision making in embedded system design. In: Proceedings of the 4th international conference on hardware/software codesign and system synthesis, 2006; Oct. 2006 p. 223-8. |
Parker AC McFarland et al. The high-level synthesis of digital systems. Proc IEEE 1990:78(2):301-18. |
Parker AC McFarland et al. Tutorial on high-level synthesis. In: Proceedings of the 25th ACM/IEEE design automation conference, Atlantic City, NJ, USA; 1988, p. 330-6. |
R. Larson et al. Calculas with analytic geometry. 8th ed. Houghton Mifflin Company: 2006, p. 918-9. |
S. Salivahanan et al. Digital signal processing. Tata McGraw-Hill Publishing Company Limited: 2006. p. 439-44. |
P. G. Paulin et al. Scheduling and binding algorithms for high-level synthesis. In: Twenty sixth conference on design automation, 1988, p. 1-6. |
http://www.cadence.com/support/university/Pages/default.aspx. |
ISE 9.21 Quick Start Tutorial, Xilinx ISE 9.2i, Software manuals and help, <http://www.xilinx.com/support/sw—manuals/xilinx92/download/>. |
http://www.xilinx.com/publications/xcellonline/xcell—54/xc—ssinterface54.htm. |
http://www.synopsys.com/Tools/SLD/AlgorithmicSynthesis/Pages/default.aspx. |
G. De Micheli, Synthesis and Optimization of Digital Circuits, New York, McGraw-Hill, 1994. |
Number | Date | Country | |
---|---|---|---|
20120159119 A1 | Jun 2012 | US |