Table 1 summarizes relative errors of the simulation results from the abstraction model in comparison to simulation results from a cycle-accurate simulation from executing TPCC workload in accordance with an embodiment of the present invention.
Table 2 summarizes relative errors of the simulation results from the abstraction model in comparison to simulation results from a cycle-accurate simulation from executing SPECJBB workload in accordance with an embodiment of the present invention.
The following description is presented to enable any person skilled in the art to make and use the invention, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present invention. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the claims.
The data structures and code described in this detailed description are typically stored on a computer-readable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. This includes, but is not limited to, volatile memory, non-volatile memory, magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs), DVDs (digital versatile discs or digital video discs), or other media capable of storing computer readable media now known or later developed.
Note that the present invention can generally be used to simulate the performance of any type of computer system and is not meant to be limited to the exemplary computer system illustrated in
Defining the Statistical Distributions
A modem computer-system-memory can service multiple memory requests concurrently. Given enough processor and system resources (such as the instruction scheduling window or the load queue size) coupled with some inherent parallelism of the workload, some of the actual memory references can be overlapped, thereby reducing the effects of memory-system-latency on system performance. In particular, studies show that cache misses get overlapped due to their burstiness in an out-of-order processor. This overlap can take a variety of forms. Some cache misses may be overlapped almost entirely, some may overlap just barely, and there may be any number of the cache misses, all overlapped to some extent. Hence, to fully characterize the overlap between cache misses, we need to take into account not only the mean value of the overlap, but also its variation.
In one embodiment of the present invention, a memory-system model comprises two sets of statistical distributions associated with each of the following three memory reference types: loads, instruction fetches, and stores. The first set of triple statistical distributions, designated as αld, αif, and αst (i.e., α distributions), characterize distances between consecutive cache misses of each type, respectively. Note that these distributions can be used to characterize the bursty miss behavior described above. Also note that smaller distances between consecutive misses make them easier to overlap. In one embodiment of the present invention, the cache misses are specifically directed to L2 cache misses.
In one embodiment of the present invention, we express distances in units of instructions or memory references (hits or misses) instead of wall clock time. This is convenient because wall clock time typically includes processor stall time, which is dependent on memory-interconnect-latencies. Consequently, if wall clock time were used, the distances would have to be corrected for any constituent stall time. In contrast, because no instructions or memory references are issued during stall periods, using instructions or memory references to measure distances obviates the need for such correction. On the other hand, memory-interconnect-latencies and system response time are measured in wall clock time or processor clock cycles.
Note that the α distributions are typically dependent on memory-system configurations. For example, changes to the L2 cache miss rates (e.g., via sharing or the number of threads in the system) can clearly change these distributions. Because the α distributions for the abstraction model are obtained from a specific memory-system configuration (which is described in details below), the actual α distributions (α′ distributions) used to simulate each new memory-system configuration are obtained from the α distributions either by resealing the α distributions to match the total miss rate of the new memory-system configuration (as is detailed below) or by using a cache simulation.
The second set of triple statistical distributions, designated as τld, τif, and τst, characterize the distance between a cache miss of each type and the beginning of a stall period caused by that miss, respectively. We define a processor stall as the state in which the functional units are completely idle, and no further instructions of any type can be retired or issued until the miss is returned. Hence, these τ distributions summarize the amount of time that the processor is able to execute past a miss. Larger τ values facilitate looking farther ahead for miss overlapping opportunities. We look at each abstraction in more detail below.
The first abstraction τld summarizes the distance between the point when a load miss occurs and the point when the result associated with the load is required to avoid stalling.
A “demand instruction miss” stalls the processor immediately. In the case of a processor that prefetches instructions, the abstraction τif summarizes the distance between the prefetch and the would-be demand miss. In particular, this shows how the proposed framework handles prefetches.
Note that a store miss typically stalls a processor only if it causes the store buffer to become full. Hence, abstraction τst summarizes the distance between a store miss and the time when the store buffer fills up. Note that τst may be slightly sensitive to the memory-system interconnect configuration depending on the specific mechanism that is used to handle the stores and the specific memory-consistency model that is used.
The six distributions described above provide a basis for understanding cache miss behavior of the processor.
Additionally, because the distances are measured in the number of instructions or memory references while memory latencies and system response time are measured in processor clock cycles, it is necessary to provide a conversion between the two measurement metrics. One embodiment of the present invention provides a conversion factor sinf for the model, wherein sinf represents the rate of execution in cycles per instruction (CPI) or cycles per reference. More specifically, sinf is obtained for an infinite (L2) cache when the processor is not stalled. Note that infinite cache CPI is a standard parameter used in many system models (see Sorin, D., Pai, V., Adve, S., Vernon, M., and Wood, D. 1998, “Analytic Evaluation of Shared-Memory Systems with LP Processors,” In Proceedings of the 25th International Symposium on Computer Architecture, 380-391).
Obtaining the Statistical Distributions
The abstraction components α and τ for the high-level model are obtained empirically by performing a trace-driven simulation on a full-scale computer system model. More specifically,
The system starts by receiving a cycle-accurate simulator for a processor endowed with a generic memory (step 202). In this model, we assume that the L1 cache configuration is fixed and a cache miss refers to an L2 cache miss unless noted otherwise. We also assume that the L2 cache latency is fixed.
Next, the system receives a workload which comprises a set of traces, wherein each trace comprises a sequence of memory references (step 204). Note that a given workload can include millions of memory references. In one embodiment of the present invention, the workload is a benchmark used to evaluate computer system performance.
The system then applies the workload to the computer system model to simulate the actual execution of the processor that is being modeled (step 206). Specifically, the system performs a cycle-accurate simulation of the workload executing on the cycle-accurate simulator, which generates trace records corresponding to different memory-reference-related events. In particular, the cycle-accurate simulation generates miss traces for each type of cache misses. For example, a miss trace for a load can include all of the following: the memory reference type (i.e., load), the cache/memory level that supplied the data (e.g., DRAM), the start and finish times of the load miss, and the duration of a stall period (if the miss causes a stall).
Note that because the traces record memory-reference-related events as opposed to other instructions, one embodiment of the present invention measures distances in units of memory references. Alternatively, a trace can maintain an instruction count for each reference it records.
Next, the system collects a set of sample values from the traces for each type of memory-reference-related events (step 208). Specifically, to record sample values for the α distributions, the system records the number of memory references from the trace between consecutive cache misses of the corresponding type. For example, assuming that references are sorted according to their start times, if references 1,005 and 1,017 are consecutive load misses, the system adds 1,017−1,005=12 (memory references) to the αld samples. Similarly, a τ sample value is obtained by recording the number of memory references that fall between the start of a cache miss and the beginning of a stall period caused by the cache miss.
Note that during the process of sample collection, if coalesced memory references directed to a same cacheline generate a cluster of miss traces, only the earliest miss trace in the cluster should be collected.
The system next constructs a statistical distribution for each type of memory-reference-related event based on the collected sample values (step 210). One embodiment of the present invention constructs a percentile distribution from the collected sample values based on their magnitude (in number of memory references). Hence, each sample value is ranked to a percentile value between (0%, 100%), which can be referred to as the frequency of this sample value. Note that multiple occurrences of a same sample value are ranked separately. This facilitates sampling from this percentile distribution during modeling. We describe how to use these statistical distributions to model different memory-system configurations below.
Illustrative Description of the Simulation Process
We now describe how to simulate the execution of a workload on a given memory-system design by using the abstractions of α′, τ, and sinf.
For this memory-system configuration, we assume that we know the cache miss rates per memory reference for each memory reference type. We also assume that we know various hardware latencies (e.g., L3 latency, main memory latency, etc.) which describe the memory-system interconnect, wherein the latencies are measured in wall clock time units (e.g., in processor cycles). Typically, these latencies are sums of hardware latencies and queuing delays. During the simulation, we maintain both a memory reference time tr and a wall clock time tw as shown in
We generate load misses for this illustration. Specifically, L2 cache misses are generated using the interarrival distributions α′. Note that if a first L2 load miss 302 takes place at time (tr0; tw0), a next L2 load miss 304 would occur at time tr=tr0+m (see
Because we know the various cache miss rates, each L2 miss can be probabilistically chosen to hit at a particular memory-system component (e.g., a L3 hit, a memory hit, a remote L2 hit, etc.), thereby allowing determining the latency l corresponding to that L2 cache miss.
Next, we determine the stall for the first load miss 302 using the τld distribution. For example, load miss 302 issued at time (tr0; tw0) causes a stall 306 at tr=tr0+k (see
Referring to
To determine an appropriate action if m=k, we note that a stall period corresponding to k references in the τld distribution effectively starts between the kth and the (k+1)th reference after the original miss. That is, the kth reference occurs before the stall begins. Hence, when m=k, the second miss is issued prior to the stall and the stall time is overlapped with servicing this later miss. At the end of the stall period, tr remains to be tr0+k because no references are issued during the stall period. This observation also applies to instruction fetches and store misses, which are modeled analogously.
Note that during a stall period, all outstanding misses of any reference type are serviced concurrently. In particular, the miss processes corresponding to different reference types are now dependent when observed in time tw. It would be apparent to one of ordinary skill in the art that larger interconnect latencies impact simulation estimates by increasing the duration of stall periods. Furthermore, larger L2 cache miss rates pack more misses in front of a stall, which potentially allows for greater miss overlap while causing more frequent stalls.
We now describe how to obtain α′distributions from the abstraction distributions α. Note that the mean values of the α distributions are reciprocals of the corresponding known miss rates. For a given new memory-system configuration, we use rescaled versions of α, so that the implied miss rates corresponding to the rescaled distributions α′ match the miss rates for the modeled interconnect. Again using loads as an example, the corrected distributions α′ld can be effectively obtained by applying a stretching factor of mL2ld/m′L2ld to αld, where m′L2ld and mL2ld are the total L2 load miss rates for the new and the generic memory-system configurations, respectively.
In another embodiment of the present invention, an exponential resealing technique can be used to obtained α′ from α. Specifically, given samples x1, . . . , xn from αld, we numerically find an exponent ρ, such that
and then use (1+x1)ρ−1, . . . , (1+xn)ρ−1 as the new samples for α′. This technique may be more intuitive because the logarithmic scale for miss rates and interarrival distances is often considered natural (see Gluhovsky, I. and O.Krafka, B. W., “Comprehensive Multiprocessor Cache Miss Rate Generation Using Multivariate Models,” ACM Transactions on Computer Systems 23(2), 111-145, 2005).
In yet another embodiment of the present invention, distributions α′ can be obtained through cache simulations of memory-system configurations of interest. Note that the same cache simulation can be used to obtain cache miss rates for these different memory-system configurations.
We now look into a particular detail of how we estimate the τ distributions. Note that there are some memory references which do not cause stalls in the trace. This behavior is expected for many stores and other memory references that coalesce to the same cache line which is associated with other stalling references. However, there are a number of scenarios where a reference does not cause a stall in the trace, but would cause a stall if we could observe execution long enough after the reference had been issued without interruptions from other references.
For example, suppose that load miss 302 in
We can correct for this bias by using the Kaplan-Meier technique for censored data (see Kaplan, E. L. and Meier P., “Non-Parametric Estimation from Incomplete Observations,” Journal of the American Statistical Association, 53, 457-481, 1958). More specifically, for each reference we record the τ time if it is observed. If it is not observed, we record the number of references issued while the miss is outstanding and annotate it to signify that the τ time is at least as large as the recorded number. This data provides standard input to the Kaplan-Meier technique.
Process of Simulating the Given Memory-System Configuration
The system starts by receiving a memory reference during execution of a given workload (step 402). The system then determines if the memory reference generates a cache miss (step 404). If not, the memory reference returns to the processor, and the system returns to step 402 to process the next memory reference. Otherwise, if the memory reference generates a cache miss, the system records both the memory reference time and the wall clock time when the cache miss occurs (step 406).
Next, the system computes the latency l (in wall clock time) of the cache miss based on the cache miss rates (step 408). Specifically, computing the latency involves probabilistically choosing a particular component in the memory system which is ultimately accessed by the cache miss based on the cache miss rates.
The system then determines a stall time associated with the cache miss (step 410). Specifically, the system determines the stall time (k) by sampling the corresponding τ distribution associated with the memory reference type. In one embodiment of the present invention, sampling the τ distribution involves randomly selecting a number from a percentile ranked τ distribution, wherein the percentile distribution was generated using the method described above.
The system then determines if the stall actually occurs by comparing wall clock times k×sinf (wherein sinf is the infinite cache CPI) and l (step 412). If not, the memory reference returns before the stall, and the system returns to step 402 to process the next memory reference.
Otherwise, if the stall due to the cache miss indeed occurs, the system fixes the memory reference time and wall clock time for the beginning and the end of the stall (step 414).
Note that based on the above simulation process, we know the exact time of the current cache miss in memory reference time. We can then determine the next memory reference that would generate a next cache miss. This is achieved by sampling from a corresponding α′ distribution in a manner similar to sampling the τ distribution.
We evaluate the abstraction model by applying it to a 2.8 GHz AMD Opteron™ processor running TPCC and SPECJBB workloads. In doing so, we show that the abstraction model remains latency invariant for the both workloads. More specifically, we perform a cycle-accurate simulation of the Opteron processor endowed with a 1 MB L2 cache and the main memory (DRAM), wherein the latency of the main memory is varied. Four DRAM latency levels are considered: 1 ns, 11 ns, 30 ns, and 190 ns. The corresponding average load-to-use system latencies are 58 ns, 68 ns, 87 ns, and 247 ns, respectively.
Note that the columns in the middle of Table 1 and Table 2 represent the rate of execution sinf for the four different memory latencies for TPCC and SPECJBB respectively in cycles per reference. We conclude that the variations of sinf are negligible.
Note that the logarithmic horizontal scale is used for the α distribution graphs. Also note that on each graph, cdfs corresponding to the four different memory latencies are overlaid. In the case of τst, the graphs do not reach the ordinate of one because a nonnegligible percentage of stores does not cause a stall. It is easily seen that latency changes cause indistinguishable differences to five TPCC distributions and all SPECJBB distributions.
TPCC distribution τst as illustrated in
we get close to the smallest system latency considered (58 ns). Hence, we reach an agreement in the common region of observation. A simple remedy for this problem is to use a τst which corresponds to a large latency (e.g. 247 ns) in the abstraction. We do not find this problem in the case of SPECJBB where a very small percentage of stores causes stalls after 58 ns (see
We compare the model results with results given by cycle-accurate simulation. Let rld,lmod,l′ be the number of L2 load misses issued per second in a system with memory latency l′ which is computed by the model when using the abstraction that corresponds to memory latency l. That is, the abstraction is computed using a trace from simulating a system with latency l.
We then use this abstraction to model a system with a possibly different latency l′. Furthermore, let rld,lsim be the corresponding quantity given by the cycle-accurate simulation of a system with latency l. First, we use the same latency l for both the abstraction and the modeled system.
Table 1 summarizes relative errors of the simulation results from the abstraction model in comparison to simulation results from a cycle-accurate simulation from executing TPCC workload in accordance with an embodiment of the present invention.
Table 2 summarizes relative errors of the simulation results from the abstraction model in comparison to simulation results from a cycle-accurate simulation from executing SPECJBB workload in accordance with an embodiment of the present invention.
The load column in the left half of Table 1 represents TPCC ratios rld,lmod,l′/rld,lsim for the four latencies under consideration. The instruction and store column entries are defined analogously.
The left half of Table 2 contains the corresponding numbers for SPECJBB. We observe that the errors range from 0% to 3%, which indicates that the abstraction captures most of the important information about the processor as it impacts system performance.
Next, we investigate the effect of using a fixed abstraction to model systems with different latencies. This ability is important because our goal is to model a variety of memory-system configurations with a single processor abstraction. Because latency invariance of the abstraction primitives has already been shown, we do not expect any notable changes in the model accuracy. The right halves of Tables 1 and 2 present ratios rld,247mod e,l′/rld,lsim for the two workloads respectively. That is, we use the same abstraction obtained for the average memory latency l=247 ns (corresponding to setting the DRAM latency to 190 ns) to model systems with the other three latencies. Note that, we observe similar small errors.
Finally, we vary the size of the L2 cache between 256 KB and 2 MB to show that the abstraction is insensitive to changes in the cache configuration.
Note that the proposed abstraction is also insensitive to changes to other types of cache configuration, which include, but are not limited to: the number of cache levels, cache parameters (e.g., size, associativity, sharing), cache-coherence protocol used, NUMA (nonuniform memory access) interconnect used, and directory-based lookup.
The present invention provides a technique for modeling computer system performance by using a generic high-level system model. In comparison to the existing modeling techniques, the present invention provides a number of advantages.
First, the model abstraction is portable and can therefore be used in a variety of system modeling contexts rather than being hardwired into a specific modeling methodology.
Second, the present invention abstracts the processor activity that is relevant for performance modeling. In particular, this model probabilistically models cache miss overlaps for different types of cache misses, processor stalls, and miss burstiness. Note that we do not need to make approximations or questionable assumptions for invariances that are typical in existing models. Instead, we find the invariances that are inherent to system behavior, which are intuitive and present a compact description of the interaction of the processor and the memory subsystem. At the same time, the invariances can be used to provide extremely accurate estimates of system performance.
Third, the present invention permits straightforward extensions to account for advanced architectural features, which can include: instruction prefetching, data prefetching, and speculative activity during stall periods referred to as runahead execution. For example, we modeled instruction prefetching in an Opteron processor within the same framework. Furthermore, the model permits efficient assessment of benefits of these architectural features. More specifically, the model provides insights into the way the architectural features improve performance by generating a process that is typical of the new execution pattern. In the instruction prefetch example, it would be straightforward to determine stochastically how many other misses can be overlapped with a prefetched instruction miss, which would otherwise stall the processor immediately.
Finally, the same abstraction is suitable for modeling any computer system configuration, including: multiprocessor systems, memory subsystems with different number of levels of cache hierarchy, different cache configuration in each cache level (e.g., cache size, cache associativity, cache sharing), and various cache-coherence protocols, nonuniform memory access (NUMA) interconnects, and directory-based lookup. Moreover, the abstraction primitives can be obtained by parsing through a single trace obtained from single core simulation.
The foregoing descriptions of embodiments of the present invention have been presented for purposes of illustration and description only. They are not intended to be exhaustive or to limit the present invention to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the present invention. The scope of the present invention is defined by the appended claims.
This application hereby claims priority under 35 U.S.C. §119 to U.S. Provisional Patent Application No. 60/789,963, filed on 5 Apr. 2006, entitled “Method for Evaluating Opteron Based System Designs,” by inventor Ilya Gluhovsky (Attorney Docket No. SUN06-0729-US-PSP).
| Number | Date | Country | |
|---|---|---|---|
| 60789963 | Apr 2006 | US |