The present invention relates generally to data processing and, more particularly, to circuits sharing hardware resources for optimizing data processing performance and/or managing power consumption.
As is known in the art, there are a variety of known systems and architectures for processing data. It is expected that next generation communication-enabled systems will appear in homes, offices, cars, military equipment, and the like. While performance, area, and power constraints have to date been the primary focus in designing many current systems, such systems generally have a limited ability to dynamically adapt to changing processing requirements. Disadvantages of such non-configurable systems include a lack of reusability and limited product longevity.
While some level of configurability can be achieved by programming a general purpose embedded processor, or by coupling a Field Programmable Gate Array (FPGA) with a processor, sometimes performance requirements prohibit such combinations. Thus, to meet certain performance requirements, it may be required to design Application Specific Configurable hardware (ASC). A number of application-specific configurable architectures have been proposed having varying granularity (fine vs. coarse), routing resources, configuration abilities, and underlying computational models, e.g., SIMD (Single Instruction-Stream Multiple Data-Stream) vs. MID (Multiple Instruction Multiple Data).
One known architecture utilizes Simultaneous Multi-Threading (SMT) that allows the interleaving of instructions from more than one software thread in a single time slice, thus eliminating processor underutilization when a thread is stalled for cache missed or data and/or control dependencies. SMT spreads software instructions to multiple functional units but does not address hardware under utilization.
It would, therefore, be desirable to overcome the aforesaid and other disadvantages.
The present invention provides a circuit having dynamic adaptability by using hardware threading. Circuit processing elements are interconnected to enable dynamic borrowing of hardware processing resources of a first processing element by a second processing element. With this arrangement, parallel processing of application pipeline stages is achieved to enhance overall processing performance. In addition, performance and power reduction can be emphasized to meet the needs of a particular application. While the invention is primarily shown and described in conjunction with multimedia and communication applications, it is understood that the invention is applicable to circuits in general in which increased throughput and/or power reduction is desirable.
In one aspect of the invention, a method of designing a hardware threaded circuit architecture includes determining a total area available for processing elements and determining a set of task arrival times for tasks to be processed by the processing elements. The method further includes determining possible implementations for the processing elements within the area available with each of the possible implementations having a corresponding number of processing elements. In addition, the method can include interconnecting at least two of the processing elements to enable hardware threading, determining overall system wait times for the possible implementations, and selecting an implementation based upon the overall system wait times.
In another aspect of the invention, a hardware threaded circuit includes a memory, a task manager coupled to the memory, and a plurality of processing elements coupled to the task manager. First and second processing elements are interconnected in a hardware threaded configuration to enable dynamic borrowing of processing resources associated with the second processing element by the first processing element. With this arrangement, pipeline stages can be processed in parallel by utilizing resources of the first and second processing elements simultaneously.
The invention will be more fully understood from the following detailed description taken in conjunction with the accompanying drawings, in which:
In general, the present invention provides a novel hardware optimization scheme, referred to as hardware threading (HT), that is useful in building Application-Specific Configurable (ASC) hardware. The inventive hardware threading mechanism dynamically borrows unutilized or underutilized resources to boost processing performance and/or lower power consumption. With this arrangement, in exemplary application domains, such as multimedia and communications, an incoming packet (task) is processed independently of other packets so as to enhance the overall circuit performance.
While the inventive hardware threading technique described herein is primarily shown described in conjunction with multimedia and communication application domains, it is understood that the invention is applicable to circuit architectures in general in which it is desirable to optimize utilization of hardware resources. It will be appreciated that the hardware threading mechanism of the present invention is well suited for applications that require on-demand high-performance processing capabilities.
In an exemplary embodiment, each incoming packet can be processed independently of other packets. Parallel hardware processing therefore improves performance. In addition, the workloads are dynamic and can differ significantly depending on the external environment. These characteristics allow the use of stochastic processes and queuing theory, for example, to analyze system performance and to guide the synthesis of optimal hardware-threaded architectures.
As discussed above, certain applications, such as multimedia and network processing application domains, may benefit from an ability to process tasks independently. As used herein, task refers to an independent packet or thread that arrives and should be processed.
As described more fully below, the hardware threading mechanism provides temporary borrowing of unutilized pipeline stages from other processing elements to boost throughput performance and/or to reduce the power consumption of tasks running on active processing elements. Borrowing resources involves adding interconnecting and steering data among shared processing elements. With borrowed resources, a task can complete significantly faster by shortening the time consumed by each task.
The task manager 204 processes the queued tasks and sends them to a respective one of a series of processing elements 206a-n. It is understood that the system can include any number of processing elements 206. The processing elements can be provided as circuitry capable of providing various computational functions, such as addition, subtraction, multiplication, division, and other logical and mathematical operations typical in functional units or specialized for specific applications.
The second processing element 206bincludes an input module 218 providing tasks to a first type A resource 220a and a multiplexer 222. The signals provided to the second type A resource 210b of the first processing element 206aare also provided to the multiplexer 222. A second type A resource 220b receives signals from the multiplexer 222 and provides an output received by a type B resource 224 and to the first processing element multiplexer 216. The type B resource 224 provides signals to an output module 226 of the second processing element.
In one particular embodiment, the type A resources 210a,b, 220a,b correspond to addition resources and the type B resources 212, 224 correspond to multiplication resources. It is understood, however, that the processing elements can include a wide variety of resource types and numbers of resources with departing from the present invention.
A function “func” can be defined as:
Referring now to
Referring now to
Alternatively, the frequency and VDD (supply voltage) may be reduced to save power, as described more fully below. The effects of increased delays (impacting the operating frequency) and interconnect capacitances (affecting power consumption) due to resource borrowing can be weighed against the resulting increased adaptability and performance.
Illustrative target applications perform on-demand execution in an environment that provides a dynamic workload, which can be characterized by a mean arrival rate λ of tasks and by the pattern in which tasks arrive. In one embodiment, it can be assumed that the random inter-arrival times are a sequence of independent, identically distributed random variables that can be modeled as an exponential distribution. Such a distribution has the well-known Markov or memoryless property, which states that if time t has elapsed with no arrivals, then the distribution of further waiting time is the same as it would be if no waiting time had passed. That is, the system does not remember that t time units have produced no arrivals. This is the case for network processors where the arrival of packets is independent. It can be assumed that the workload will continue to have a mean arrival time λ for a long enough time (e.g., order of milliseconds) for the system to achieve a steady state.
A service time within each processing element can also be modeled as an exponential distribution, for example. This implies that the service time remaining to complete a customer service is independent of the service already provided. The processing element service time will be referred to as Ws, and the average service rate is 1/Ws.
Since the exemplary architectural model includes c independent processing elements (PEs), one can adopt a system model based on the M/M/c queuing system, which is well known to one of ordinary skill in the art. M/M/c queuing systems assume random exponential inter-arrival and services times with c identical servers. Note that the M/M/c queuing model can also be developed for other distributions for both the workload and the service time. For example, one might choose to model bursty workloads, or use a hyperexponential distribution for the service time if a large variance exists relative to the mean. It will be readily appreciated by one of ordinary skill in the art that a variety of queuing systems and models can be used without departing from the invention.
In one particular embodiment, the queuing parameters are as follows:
One relevant metric in designing and evaluating ASC architectures includes the server utilization ρ, which indicates how efficiently the hardware is being used. The server utilization ρ can be calculated as set forth below in Equation 1:
Another relevant metric is the average wait time in the system W, which is inversely proportional to the system throughput. The average system wait time W is given by Equation 2 below
W=Ws+Wq Eq. (2)
It is understood that the average steady state time Ws a task spends in the processing element is dependent on the processing element implementation. Recall that it is an average steady state response which takes into account the data dependencies within each task. The average steady state time Wq a task spends in the queue is somewhat more complicated to calculate since it is dependent on the expected steady state number Lq of tasks queued and waiting to be serviced and the expected steady state number Ls of tasks being serviced. The expected steady state number Ls of tasks being serviced can be determined as set forth below in Equation 3:
Ls=λ×Ws Eq. (3)
If the expected steady state number Ls of tasks being serviced is less than c, the expected steady state wait time in the queue will be zero; however, that is not the general case. The average steady state time Wq can be calculated in accordance with Equation 4:
The average length of the queue, or the steady state number of tasks in the queue Lq is determined by Equation 5:
C[c,Ls], which is known as Erlang's C formula, specifies the probability that an arriving task must queue for processing element service. Erlang's C formula can be expressed as set forth below in Equation 6:
The utilization ρ, is restricted to be less than 1, since otherwise, the average length of the queue Lq and thus Wq will be either infinite or negative. The system simply becomes overloaded.
In one aspect of the invention, an optimal ASC architecture is designed under different workload arrival rates λ. Given a fixed area, and several possible processing element implementations that vary in their implementation area and thus, their performance (Ws) it is desirable to determine the optimal number of processing elements that will provide the best performance under different workload conditions.
Given a set of arrival times Λ={λ1, λ2, . . . Λk}, a fixed area A, and several possible implementations for a processing element, it is desired to find the number of processing elements c that maximize throughput across all given λ ε Λ.
In one particular embodiment, the optimal number of processing elements c can be found as follows: Because the total area is fixed and the area for each possible processing element implementation is known, one can determine each c that is possible. For each arrival time λi ε Λ, the system wait time W can be calculated using the equations presented above. The task is then to choose the number c of processing elements that minimizes the overall system wait time W.
There is a total area of 100 square units, three different processing element designs (PEa, PEb, PEc), each having different area and service time tradeoffs as shown below in Table 1.
Table 1 shows three different processing element designs that trade performance for area. Given a fixed area of 100 units square, the number of possible processing elements is calculated for each design. First,
is determined, where Ai is the implementation area for PE design i. The results are shown on the last row of Table 1.
As described above, hardware threading refers to the ability to borrow unused resources from idle processing elements and thread them together to obtain a more optimized architecture. Processing elements can operate either independently without threading, or with “n-threading” when resources from n processing elements are collected to create a threaded processing element. A dynamic scheduler, which can reside in hardware or in software within the master processor (see
In step 304, the HT schedule is translated to Register Transfer Level (RTL) for HDL (Hardware Description Language), which can then by synthesized using a tool, such as Design Compiler from Synopsys Inc., in step 306. Before describing the details of the algorithm, an example is provided below.
The following code implements a function f.
Resource constraints can include, for example, only one adder and one multiplier being available in one clock cycle. An add and a multiply operation can be scheduled at the same time, for example, but cannot be chained together. The corresponding ASAP schedule is shown in
In one embodiment, the HTScheduler, which is described in detail below, uses an ASAP schedule to produce a new HT schedule. The threading of two processing elements is conceptually shown in
For example, the dashed edge from +3 to < indicates that, when hardware threading, state <1 becomes the next state after super state (+3,+4), skipping state +4 in-between. The resulting two-threaded (n-threaded, where n=2) schedule is outlined by the new states and the dashed edges. The threaded version reduces the number of states along the worst case path to six. The reduction in states directly corresponds to a reduction in task execution time, and results in an overall lower service time for the processing elements.
The inventive synthesis algorithm, an exemplary embodiment of which is shown in
In one particular embodiment, the following variables are used in the embodiment shown in
The CurrentState is initially set to the first state in the non-threaded schedule (see line 1). All states in the unthreaded schedule are traversed until the end of the schedule is reached (lines 2, 28). If the CurrentState is a conditional state, the scheduler is recursively called with a subschedule containing all the states along each conditional path (lines 3-5). Once a state is examined (line 6), it is marked as already scheduled (line 7) and the next states within a window of size w are examined (lines 8-18).
The IsSchedulable routine (line 14) checks for resource constraints and for control and data dependencies between the CurrentState and NextState. If the NextState can be scheduled with the CurrentState, then NextState, as well as its edge, are added to the respective queues. The NextState is marked as borrowed and operations of CurrentState and NextState are combined (line 18).
Once all of the states in the window are examined, the edges for the threaded schedule are created (lines 19,27). The new edge maybe created between the CurrentState and the successor state (line 27), or one or more states is skipped due to combining operations within the skipped states with states earlier in the schedule (lines 20-25). In the latter case, the edges and states are skipped one at a time until all Borrowed states are skipped.
While the illustrative algorithm is implemented in the C++ programming language and the HT schedule translation is implemented in synthesizable Verilog, it is understood that other suitable programming languages and synthesis tools will be readily apparent to one of ordinary skill in the art.
As is known in the art, the Discrete Cosine Transform (DCT) is widely used in image compression techniques, such as JPEG (Joint Photographic Experts Group) and MPEG (Moving Picture Experts Group). As described below, the DCT can be used to show the applicability of hardware threading in accordance with the present invention to a multimedia domain. Equation 7 below implements the DCT:
where N is the sequence length, x(n) is the discrete-time signal, and X(k) is the resulting spectral content.
The system first generates a DCT CDFG, which contains two conditional nodes as shown in
The resulting optimal threaded schedule is shown in
As shown in
The resulting threaded schedules were then transformed into Verilog HDL, for example, and synthesized using Synopsys Design Compiler, for example, to generate the results shown in Table 2 above. The area overhead was due to additional multiplexers needed to select between the inputs to some of the states. The clock period increase was caused by: (a) the introduction of multiplexors along the critical paths, and (b) the added delay due to the additional interconnect associated with the multiplexers and due to the added fan out load on some of the signals. For this example, a window size of one (w=1) did not increase the clock period from the non-threaded implementation (w=0), because the inserted multiplexer was not on a critical path. In one embodiment, to model these loads the designs were synthesized with a wire load model, which had a load slope of 0.311. The slope was used to estimate the loads associated with the wire lengths. The wire load model was used to derive a minimum clock period for the designs.
The service times for the non-threaded and threaded schedules were simulated and compared. In one embodiment, an abstract cycle-based Verilog model was used for the simulation/comparison. The system included tasks arriving independently of one another with the time spent in the in the queue (Wq) and the processing time (Ws) recorded.
As the arrival rate changed, the system wait times varied as shown in
In another aspect of the invention, a threaded mode of operation can be used to lower average system power while maintaining a constant system wait time similar to that of a non-threaded mode of operation. Power can be reduced by decreasing the clock frequency, and/or decreasing the supply voltage and clock frequency given below in Equation 8 below:
Pdyn=CLV2DDf Eq. (8),
where VDD is the supply voltage, and f is the clock frequency, and CL is the load capacitance.
The following describes an analysis between the non-threaded and threaded architectures using a window size of zero and two respectively. The power reduction is based on Ws with a constant Wq between the two architectures. By allowing the threaded architecture average steady state time Ws a task spend in the processing element to be equivalent to that of the non-threaded architecture, the frequency of the threaded architecture can be scaled down. For example, the lowest arrival rate λ in
The supply voltage can also be scaled down since the frequency is lower than in the original case. This allows the normalized delay to be increased. The normalized delay here is 1/f, which equates to 1.15. If the original supply voltage was 5V, then the supply voltage can be reduced to ˜3.75V, given a normalized delay curve. The voltage and frequency scaling for this example shows Pnew=0.49Pold. The threaded architecture allows a substantial savings in power for the same system wait time as the non-threaded architecture.
As is known in the art, the Discrete Fourier Series (DFS) can represent periodic discrete-time signals, and is often used for calculations involving linear time-invariant systems. DFS was investigated to show the applicability of the inventive hardware threading to a multimedia domain. The coefficients of the DFS were calculated according to Equation 9 below:
where N is the sequence length, x(n) is the discrete-time signal, and ck is the resulting discrete-time Fourier-series coefficient.
The synthesis and investigation procedure followed those performed for the DCT above. The number of states versus the window size can be seen in Table 3 below. After a widow size of one, the operation level parallelism was exhausted in the non-threaded schedule.
Table 3 shows area and timing overhead associated with hardware for the DFS example. Similar to the DCT, timing delays were associated with the additional interconnect. With only a state reduction of one, the hardware threading overheads were minimal. The two states combined were inner loop states, and threading states within an inner loop yield the greatest performance increase.
The resulting threaded and non-threaded architectures were placed in a simulation environment and the system arrival rates were varied. The system responses are as shown in
Similarly, one can maintain the service times while reducing the frequency and VDD to lower power consumption. Looking again at
In another example, the Discrete Fourier Transform (DFT) processing is evaluated with hardware threading. The DFT examined here was defined in Equation 10 below:
where N is the sequence length, x(n) is the discrete-time signal, and X(k) is the resulting spectral content
The synthesis and simulation followed the same guidelines set forth by the previous examples. The number of states versus window size are shown in Table 4 below. Table 4. Number of states in DFT non-threaded and threaded schedules, and area and timing estimates.
Table 4 shows that an optimal threaded schedule was reached after a window size of two. Like the previous examples, the state reduction occurred in the inner most loop of the schedule, which allowed for the greatest performance increase. Table 4 also shows area and timing results in addition to the number of states. The three different hardware configurations were simulated in an environment model and the resulting system wait times were monitored.
The Fast Fourier Transform (FFT) is yet another application in signal processing that can benefit from the inventive hardware threading techniques described herein. Like the other examples, the FFT was considered for its applicability to a broad range of applications and its popularity. The FFT used in this example was in the form set forth below in Equation 11:
where N is the sequence length, x(n) is the discrete-time signal, and X(k) is the resulting spectral content. As in the other examples, the FFT was synthesized and simulated in the environmental model. The states versus window size for the FFT are located in Table 5 below.
Table 5. Number of states in FFT non-threaded and threaded schedules, and area and timing estimates.
Like the DFS, all non-threaded schedule parallelism was achieved with a window size of one.
The non-threaded and 2-threaded architectures were simulated and the resulting system wait times are shown in
The present invention provides a technique for high-level synthesis that addresses the synthesis of dynamic datapaths based on dynamic workloads. The inventive hardware threading technique requires little overhead while creating an adaptable system that allows a range of power and performance capabilities.
One skilled in the art will appreciate further features and advantages of the invention based on the above-described embodiments. Accordingly, the invention is not to be limited by what has been particularly shown and described, except as indicated by the appended claims. All publications and references cited herein are expressly incorporated herein by reference in their entirety.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US04/10320 | 4/2/2004 | WO | 10/3/2005 |
Number | Date | Country | |
---|---|---|---|
60460080 | Apr 2003 | US |