1. Technical Field
The present invention relates generally to an improved data processing system and in particular to a data processing system and method for scheduling tasks to be executed by processors. More particularly, the present invention provides a mechanism for scheduling tasks to processors of different frequencies, which can adequately provide the tasks' computational needs. Still more particularly, the present invention provides a mechanism for scheduling a task to processors of different frequencies in a manner that optimizes utilization of a multi-processor system processing capacity.
2. Description of Related Art
Multiple processor systems are well known in the art. In a multiple processor system, a plurality of processes may be run and each process may be executed by one or more of a plurality of processors. Each process is executed as one or more tasks which may be processed concurrently. The tasks may be queued for each of the processors of the multiple processor system before they are executed by a processor.
Applications executed on high-end processors typically make varying load demands over time. A single application may have many different phases during its execution lifetime, and workload mixes consisting of multiple applications typically exhibit interleaved phases. Application execution phases may generally be characterized as memory-intensive and CPU-intensive.
Contemporary processors provide resources that are underutilized during memory intensive phases and may increasingly consume larger amounts of power while producing little incremental gain in performance—a phenomena known as performance saturation. Executing performance saturated tasks on a processor at frequencies beyond a certain speed results in little, if any, gain in the task execution performance, yet causes increased power consumption.
It is known that a single application is composed of different phases. Modern processor design and research has attempted to exploit variability in processor workloads. Examining an application at different granularities exposes different types of variable behavior which can be exploited to reduce power consumption. Long-lived phases can be detected and exploited by the operating system. Frequency and voltage scaling are mechanisms used by operating systems to reduce power when running variable workloads.
Various mechanisms have been developed that utilize frequency and voltage scaling with heterogeneous processors in attempt to control the average or the maximum power consumed by data processing systems. For example, LongRun by Transmeta Corporation of Santa Clara, Calif., and Demand Based Switching by Intel Corporation of Santa Clara, Calif., both respond to changes in processor demand but do so on an application-unaware basis. In both systems, an increase in CPU utilization leads to an increase in frequency and voltage while a decrease in utilization leads to a corresponding decrease in frequency and voltage. Neither system makes any use of information regarding how efficiently the workload uses the processor or how the workload affects memory behavior. Rather, these systems rely on simple metrics, such as non-halted cycle counts during an interval of time.
Other works have included methods of using dynamic frequency and voltage scaling in the Linux operating system and focus on average power and total energy consumption. For example, an examination of laptop applications and the interaction between the system and the user has been made to determine the slack due to processor over-provisioning. These methodologies implement frequency and voltage scaling to reduce power while consuming the slack by running the computation slower. Other efforts have extended these systems to accommodate deployment to web server farms. For example, the use of request batching to gain larger reductions in power during periods of low demand has been addressed.
However, none of the previous approaches provide a mechanism for responding to changes in memory subsystem demands rather than changes in CPU utilization metrics. Thus, it would be advantageous to utilize simple metrics, such as memory hierarchy performance counters, for identifying a processor frequency most ideal for execution of an application phase. It would be further advantageous to provide a mechanism for identifying ideal processor frequencies for execution of an application phase in a multiprocessor system. It would also be advantageous to provide a mechanism for minimizing performance penalties resulting from executing memory intensive application phases at slower, less power-consumptive processor frequencies.
The present invention provides a method, computer program product, and a data processing system for optimizing task throughput in a multi-processor system. A performance metric is calculated based on performance counters measuring characteristics of a task executed at one of a plurality of processor frequencies available in the multi-processor system. The characteristics measured by the performance counters indicate activity in the processor as well as memory activity. A performance metric provides a means using measured data at one available frequency to predict performance at another processor frequency available in a multi-processing system. Performance loss minimization is used to assign a particular task to a particular frequency. Additionally, the present invention provides a mechanism for priority load balancing of tasks in a manner that minimizes cumulative performance loss incurred by execution of all tasks in the system.
The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
With reference now to the figures,
As shown in
Furthermore, dispatcher 151 may be implemented as software instructions run on processor 120-123 of the MP system 100. Dispatcher 151 allocates tasks to queues 180-183 respectively assigned to processors 120-123 and maintained in memory subsystem 130. Dispatcher 151 allocates tasks to particular queues at the direction of task-to-frequency scheduler (TFS) 150. TFS 150 facilitates load balancing of CPUs 120-123 by monitoring queue loads and directing dispatch of new tasks accordingly. Additionally, TFS 150 interfaces with (or alternatively includes) a performance predictor 152 which facilitates selection of processor frequency for a given task in multi-processor system 100. Predictor 152 evaluates task performance at one of a plurality of system processor frequencies and estimates task performance and performance loss that would result by executing the task on one or more other system processor frequencies. TFS 150 schedules tasks to particular frequencies based on the evaluation made by predictor 152 in accordance with embodiments of the invention and as discussed more fully herein below.
In the illustrative example, MP system 100 has CPUs 120-123 that are set to operate at respective frequencies f1-f4. To facilitate an understanding of the invention, assume herein that the operational frequency f1 of CPU 120 is the slowest operational frequency of the frequencies f1-f4, and that the CPU frequencies increase from f1 to f4, that is, CPU 123 operates at the highest processor frequency f4 available in MP system 100. As referred to herein, the set of available CPU frequencies f1-f4 in MP system 100 is referred to as the system frequency set F.
MP system 100 may be any type of system having a plurality of multi-processor modules adapted to run at different frequencies and voltages. As used herein, the term “processor” refers to either a central processing unit or a thread processing core of an SMT processor. Each of CPUs 120-123 contains a respective level one (L1) cache 110-113. Each of CPUs 120-123 is coupled to a respective level two (L2) cache 140 and 141. Similarly, each of L2 caches 140 and 141 may be coupled to an optional level three (L3) cache 160. The caching design described here is one of many possible such designs and is used solely for illustrative purposes. The lowest memory level for MP system 100 is system memory 130. CPUs 120-123, L1 caches 110-113, L2 caches 140 and 141, and L3 cache 160 are coupled to system memory 130 via an interconnect, such as bus 170. Tasks are selected for placement in a queue 180-183 based on their assignment to one of processors 120-123 by TFS 150.
The present invention exploits observed changes in demands on the caches 110-113, 140-141 and 160 and the memory subsystem 130 to minimize power consumption and performance loss in MP system 100. Task-to-frequency scheduling is always performed. Task-to-frequency scheduling covers all forms of memory usage ranging from ones where the processor must wait much of the time for memory operations to complete to ones where the processor waits for memory only occasionally. When the processor waits for a cache or memory operation to complete, the processor is said to “stall”, and such events are known generically as “memory stalls.” TFS is a replacement for the task scheduler of an operating system. TFS also may be used to supplement an existing task scheduler. During CPU-intensive phases, tasks are assigned to a processor running at full voltage and frequency, while tasks in memory-intensive phases are assigned to a processor running at a slower frequency and thus, possibly, a lower voltage. Processors 120-123 are preferably implemented as respective instances of a single processor generation that may be run at different frequencies and voltages. For example, processors 120-123 may be respectively implemented as PowerPC processors available from International Business Machines Corporation of Armonk, N.Y.
In accordance with an embodiment of the present invention, the TFS runs a modeling routine, or predictor 152, that facilitates scheduling of tasks to a processor core of a particular frequency and voltage by identifying an ideal frequency at which the task may be performed and predicting performance of the task if run at a lower frequency. A performance loss of the task at the lower frequency and voltage is evaluated to determine what system processor frequency to schedule the task to. That is, the predictor routine evaluates predicted performance of a task and facilitates selection of one of a set of heterogeneous processors to execute the task by determining a frequency based on memory demands of the task. Scheduler 150 uses the evaluation of the predictor and load balancing requirements to generate an appropriate allocation of tasks to processor frequencies. The TFS routine is based on the premise that a limited benefit in task execution is realized by increasing processor performance beyond a particular, task- and phase-specific point. The point of processor capacity at which little, if any, increase in task execution performance is realized with an increase in processor performance capacity is referred to herein as the “performance saturation point.” A task executed above its saturation point is referred to herein as a “performance saturated task.” Performance saturation is due to the fact that memory is generally much slower than processor speed. Thus, at some point, the speed of task execution is bounded by the memory speed. The ratio of CPU-intensive to memory-intensive work in a task determines the saturation point.
In accordance with a preferred embodiment of the present invention, process tasks, such as threads, are scheduled to particular processors by identifying a processor that will result in minimal performance loss for a particular set of fixed processor frequencies available, i.e., the system frequency set F, versus running the tasks on a set of processors whose frequencies are all the maximum frequency available in F.
To this end, an instruction per cycle (IPC) predictor is implemented in predictor 152 that uses a measured IPC of a task at one frequency and performance counters to predict an IPC value at any other frequency. TFS 150 then evaluates the benefit of allocating tasks to processors available in the system. A processor of the MP system is selected based on a metric of minimum performance loss. Additionally, the predictor accommodates changes in application memory and CPU intensity, that is, phase changes. Moreover, phase changes are detected to decide when to repartition the set of tasks, that is, when to change the processor to which associated tasks are allocated.
The predictor comprises an IPC model that includes a frequency-dependent and frequency-independent component. Equation 1 defines an exemplary IPC predictor calculation that may be implemented in predictor 152:
where Cstall is the number of cycles having stalls, e.g., branch stalls, pipeline stalls, memory stalls, and the like, and Cinst is the number of cycles in which instructions are executed. By utilizing readily available performance counters, e.g., the number of memory hierarchy references, associated latencies, and the like, equation 1 can be implemented as defined in equation 2:
where: α is the idealized IPC of a perfect machine with infinite L1 caches and no stalls;
Instructions is the number of instructions completed;
Cbranch
Cpipeline
CL2
CL3
Cmem
Cother
NL2 is the number of L2 cache references;
TL2 is the latency to L2 cache;
NL3 is the number of L3 cache references;
TL3 is the latency to L3 cache;
Nmem is the number of references to the system memory;
Tmem is the latency to the system memory; and
f is the frequency at which the execution of the process tasks are evaluated.
The α value takes into account the instruction level parallelism of a program and the hardware resource available to extract it. Since α cannot always be determined accurately, the predictor may use a heuristic to set it. One possible heuristic is to take α to be 1 for programs with an IPC less than 1 and to take α to be the program's IPC at the maximum frequency for programs with an IPC greater than 1.
Memory stall cycles can be measured directly via performance counters on some data processing systems and can be estimated via reference counts and memory latencies on other data processing systems. This embodiment uses estimation via reference counts and memory latencies; however, this is for illustrative purposes only.
The IPC, or another performance metric derived therefrom, is calculated for a task at a particular frequency f in the system frequency set F by running a given task on one of the system processors. The frequency used need not be f since the invention allows the calculation of projected values of the performance metric at any frequency given performance data collected at a single frequency. In particular, the data may be used with equation 2 to predict the performance metric(s) at any frequency in the frequency set F. Calculation of a performance metric based on performance counters obtained by running a task on a system processor is a process performed at the end of a scheduling quantum. A scheduling quantum is the maximum duration of time a task may execute on a processor before being required to give control back to the operating system.
Thus, the IPC predictor estimates the IPC for a frequency f based on the number of cycles the processor was stalled by memory accesses. If the data processing system does not support direct performance counter measurement of memory access stalls, then the IPC predictor estimates the IPC for a frequency f based on the number Nx of occurrences of memory references and corresponding time or latencies Tx consumed by the event. The predictor equation described above assumes constant values for Tx, while in reality the time Tx for an event may vary. The error introduced by assuming the Tx values are constant is small and provides the predictor with an acceptable approximation of the IPC.
The error introduced by assuming constant Tx values may be minimized by a number of methods. For example, an additional linearity assumption may be made to determine memory latencies empirically. However, such a technique requires the measurement of the IPC at two different frequencies. To this end, the IPC equation may be written as a linear equation of two variables of the form af+b=1/IPC. Thus, by taking two measurements of IPC at different frequencies, two instances of the IPC equation may be solved for a and b which are then utilized to calculate the performance predictions at other frequencies. Other methods may be implemented for reducing the error introduced by assuming constant memory latencies.
The reciprocal of the IPC may be calculated to provide cycles per instruction (CPI) as defined by equation 3:
The CPI calculation in equation 3 asymptotically reaches a CPU-intensive and a memory-intensive form. Because the CPI asymptotically reaches the two forms, those two forms can be rewritten for the IPC variants as provided by equations 4 and 5:
Accordingly, at any given frequency, the predicator can predict the IPC at another frequency given the number of misses at the various levels in the memory hierarchy as well as the actual time it takes to service a miss. This provides the mechanism for identifying the optimal frequency at which to run a given phase with minimal performance loss. As expected, the more memory-intensive a phase is, as indicated by the memory subsystem performance counters, the more feasible it is to execute the phase at a lower frequency (and voltage) to save power without impacting the performance and the better that the phase fits onto a slower processor.
An exemplary throughput performance metric for evaluating the performance of a task at a particular frequency is defined as the product of the IPC calculated at a particular frequency and the frequency itself, represented by equation 6:
Perf(t, f)=IPC(t, f)*f equation 6:
The change in throughput performance resulting from execution of the task phase at a different frequency, g, is determined according to equation 7:
The throughput performance metric Perf defined by equation 6 is used for evaluating execution performance of a task t at a particular frequency f and provides a numerical value of performance throughput in instructions per second, although other performance metrics may be suitably substituted.
The predicted IPC can be translated into a more meaningful metric for calculating the performance loss of a task t at a particular frequency f. Rather than performing evaluations directly with the predicted IPC, the predicted throughput at frequency f and the processor family nominal maximum frequency fmax are preferably used. In accordance with a preferred embodiment of the present invention, throughput is used as the metric of performance when attempting to minimize performance loss. Throughput performance is the product of IPC and frequency while performance loss is the fraction of the performance lost by running the same task at a different frequency. The incentive for using throughput as the metric for performance is apparent in view of
A maximum performance that may be attained for a given task t may then be calculated from equation 6 by calculating the performance at the maximum available system frequency fmax, which in this particular illustrative example is f4. Performance loss estimates are then deduced by comparing or otherwise calculating a measure of the difference in performance at the maximum system frequency with respect to another system frequency f. For example, equation 8 may be utilized for calculating the performance loss (PerfLoss) incurred by executing a given task t at a frequency f less than the maximum system frequency fmax:
Performance loss estimates for individual tasks can be used to minimize the performance loss of the system as a whole. Any particular task may suffer a performance loss, but as long as the system incurs the least possible performance loss under the current frequency constraints the system performance is acceptable. The total performance loss is the sum of the performance loss of each task scheduled at each frequency. If the possible frequency settings are F=f1, . . . , fm, where fm<=fmax, then the total system performance loss may be calculated by the following:
The minimum performance loss can be found by considering all possible combinations of tasks and frequencies. Each task is considered at each available frequency in the system frequency set F. The partition which produces the minimum performance loss over the entire task set is chosen. However, this approach has a number of shortcomings. The first is that the algorithm necessary to evaluate all possible combinations is computationally prohibitive for use in a kernel-based scheduler. Another problem is that the total performance loss metric described in equation 9 does not take into account the different priorities of tasks in the system. A high priority task may itself suffer a small performance loss, but that lost performance may impact other tasks or system performance parameters. To alleviate this problem, the total performance loss metric can be modified to take into account the priorities of the tasks involved. Equation 10 defines an exemplary performance loss metric calculation that accounts for task priorities that may be implemented in TFS 150:
where p(t,fi) defines a task priority for weighting of individual task performance losses based on the task priority. In this particular embodiment, to make equation 10 correct without additional modifications, the p(t, fi) must all be non-negative and have the property that larger values represent higher priorities. In some data processor systems, such as a real-time system, p(t,fi) may depend on the particular processor frequency and be different for different frequencies in F.
In one embodiment of the present invention, MP system 100 may include processors 120-123 running at set frequencies, and in such a configuration, a system frequency set F is readily identified. In such an implementation, the predictor may utilize a variant of the IPC prediction equation described in equation 3 and the performance loss metric of equation 8 to solve for an ideal frequency, fideal, at which an optimal performance of a task phase execution may be achieved. The ideal frequency fideal may be obtained from performance counter data recorded at the current frequency at which a task is executed. An exemplary calculation of the ideal frequency is defined by:
where Tmem
Alternatively, the predictor may calculate the predicted throughput performance at each of the available frequencies of the system frequency set to identify the highest frequency at which a throughput loss would occur.
The particular implementation of the multiprocessor system may be such that is not possible to schedule all tasks at their desired ideal frequency, due to, for example, power constraints. In such situations, it is necessary to make reasonable scheduling decisions based on additional criteria. It may be preferable that the criteria used for making scheduling decisions include performance loss and priority of the tasks under consideration for scheduling, although other considerations may be suitably substituted therefore.
Because the ideal frequency is unlikely to be available in the frequency set of the multiprocessor system, a desired frequency (fdesired) is then identified (step 308). As referred to herein, the desired frequency is the lowest available frequency that is greater than or equal to the ideal frequency. The performance metric Perf is then calculated for the current task at a stepped down frequency (fdown) (step 310). The stepped down frequency fdown is one frequency lower than f. In this case, f is fdesired and is thus the frequency at which a performance loss greater than or equal to ε has been predicted to occur in the event the task is performed at fdown rather than fdesired. The performance loss is then calculated for the current task at the stepped down frequency (step 312). The current task (along with the associated performance loss calculated at fdown) is then placed into a data object that is assigned to the desired frequency fdesired. As referred to herein, a data object into which tasks are placed by TFS is referred to as a bin and may be implemented, for example, as a linked list data structure, although other data structures may be suitably substituted therefore. The bin into which the task was placed is then sorted according to the performance loss at the stepped down frequency fdown (step 316), for example, from lowest to highest calculated performance loss. The initialization subroutine then evaluates whether additional tasks remain in the task set T for evaluation (step 318). If an additional task remains in the task set, the initialization subroutine selects a new task from the set of remaining tasks (step 320) then returns to calculate the maximum performance and ideal frequency for the remaining tasks according to step 306. If no additional tasks remain in the task set, the initialization subroutine cycle then ends.
Each task bin 410-413 has available entries to which TFS 450 may write tasks. In the illustrative example, task bin 410 has task bin entries 420-422, task bin 411 has task bin entries 423-425, task bin 412 has task bin entries 426-428, and task bin 413 has task bin entries 429-431. The number of entries in a task bin may be dynamically adjusted as tasks are added or removed from the bins as well as added or removed from the set of tasks, and the task bin entry configuration shown in
Returning again to step 504, if the number of tasks in the task bin that corresponds to the currently selected frequency does not equal the number of bin entries, the scheduling subroutine proceeds to determine if the number of tasks in the task bin is less than the number of task bin entries (step 506). If the number of tasks in the task bin of the currently selected frequency is less than the number of entries, the scheduling subroutine proceeds to fill the task bin of the currently selected frequency with tasks from the next lower frequency bin if one is available (step 508). In accordance with a preferred embodiment of the present invention, the tasks removed from the bin of the next lower frequency are selected based on their performance loss. Particularly, tasks are selected for removal from the task bin of the next lower frequency by selecting the task having the greatest performance loss calculated for the stepped down frequency and continuing with the one with the next greatest performance loss until a sufficient number of tasks have moved. In step 508, the goal is to minimize the performance lost at each level. By moving the tasks which suffer the largest potential performance loss to a faster frequency, the chances of large performance losses due to inadequate frequency are reduced. The scheduling subroutine then proceeds to allocate the tasks to the queue corresponding to the currently selected frequency according to (step 516).
Returning again to step 506, if the number of tasks in the task bin that corresponds to the currently selected frequency is not less than the number of bin entries, the scheduling subroutine proceeds to determine if the number of tasks in the task bin that corresponds to the currently selected frequency exceeds the number of task bin entries (step 510). If the number of tasks in the task bin corresponding to the currently selected frequency is not evaluated as exceeding the number of task bin entries, an exception is thrown (step 512), and the scheduling subroutine cycle then ends.
Returning again to step 510, if the number of tasks in the task bin corresponding to the currently selected frequency exceeds the number of task bin entries, the scheduling subroutine then proceeds to remove a number of tasks from the task bin so that the number of tasks in the task bin corresponding to the currently selected frequency equals the number of task bin entries (step 514). The tasks removed from the currently selected task bin are placed in the task bin at the stepped down frequency fdown. The tasks removed from the task bin of the currently selected frequency are selected based on the respective calculated performance penalties of the tasks in the currently selected task bin. That is, tasks are removed from the currently selected task bin by selecting the task with the smallest calculated performance penalty loss that is estimated to be incurred by executing the task at the next lower frequency, fdown. When a number of tasks have been removed and placed in the task bin of the next lower frequency so that the number of tasks in the task bin corresponding to the currently selected desired frequency equals the number of task bin entries, the scheduling subroutine then proceeds to allocate the tasks in the task bin to the corresponding processor queue according to step 516.
After tasks in the task bin corresponding to the currently selected frequency have been allocated to the corresponding processor queue, the scheduling subroutine then evaluates whether additional task bins remain for scheduling of the tasks (step 518). If additional task bins remain, the scheduling subroutine proceeds to select the next lower frequency (step 520), and then returns to step 504 to evaluate the number of tasks in the newly selected task bin. Otherwise, if no additional task bins remain, the scheduling subroutine cycle then ends.
With reference now to
The process begins in this illustrative embodiment by performing load balancing (step 600). The load balancing process is described in more detail in
Returning again to step 606, if it is determined that a benefit would be realized by executing the task at a higher frequency, one or more tasks of the target frequency (ftarget) task bin are evaluated (step 608). An evaluation is made to determine if any task in the ftarget task bin would suffer less performance loss by executing at the currently selected frequency than the current task evaluated for the selected frequency (step 610). If a task is identified in the ftarget task bin that would incur a lesser performance penalty by executing the task at the selected frequency, the task in the selected frequency task bin is swapped with the task in the ftarget task bin (step 612), and the balancing routine then proceeds to evaluate whether additional tasks remain in the currently selected frequency task bin according to step 614.
If it is determined at step 610 that no task in the ftarget task bin would incur a lesser performance loss by execution at the currently selected frequency than the task of the currently selected frequency being evaluated, the balancing routine proceeds to determine if additional tasks in the currently selected frequency task bin remain for evaluation according to step 614.
If it is determined that an additional task remains in the currently selected frequency task bin at step 614, the balancing routine accesses the next task in the currently selected task bin at step 615 and continues, evaluating the next task in the currently selected frequency task bin according to step 604. If it is determined that no additional tasks remain in the currently selected frequency task bin at step 614, the balancing routine proceeds to determine whether additional task bins remain for evaluation (step 616). Preferably, task bins are evaluated from slowest frequency to fastest. In the event that another task bin remains for evaluation, the balancing routine proceeds to select the next higher frequency (step 618), and processing returns to step 603 to select the task bin assigned to the currently selected frequency. If it is determined that no additional task bins remain for evaluation at step 616, the process then determines whether the target frequency of ftarget is equal to fup, the next higher frequency above ftarget unless ftarget is f4, the greatest available frequency (step 620). If the target frequency ftarget is not equal to fup, ftarget is set to fup (step 622) with the process then returning to step 602. This causes the process to reiterate through all of the steps described above for ftarget equals fup. When the process reaches step 620 with ftarget equal to fup, the process terminates because there are no remaining frequencies in F to consider.
In this manner, the present invention provides a method, apparatus, and computer instructions for identifying ideal processor frequencies for execution of an application phase in a multi processor system. In these illustrative embodiments, the processes and mechanisms described minimize performance penalties, reduce power and allow responses to be made to changes in memory subsystem demands.
It is important to note that while the present invention has been described in the context of a fully functioning data processing system, those of ordinary skill in the art will appreciate that the processes of the present invention are capable of being distributed in the form of a computer readable medium of instructions and a variety of forms and that the present invention applies equally regardless of the particular type of signal bearing media actually used to carry out the distribution. Examples of computer readable media include recordable-type media, such as a floppy disk, a hard disk drive, a RAM, CD-ROMs, DVD-ROMs, and transmission-type media, such as digital and analog communications links, wired or wireless communications links using transmission forms, such as, for example, radio frequency and light wave transmissions. The computer readable media may take the form of coded formats that are decoded for actual use in a particular data processing system.
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. Although this embodiment shows the mechanism of the present invention to respond to fixed frequencies, this mechanism may be applied to respond to varying frequencies such as those changed by external agents or sources. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.