Future computing machines will likely include a greater number of processor cores, which will result in multi-threaded programs becoming more commonplace. However, developers of multi-threaded programs and hardware will need to carefully consider how memory usage is impacted by thread interaction on such machines.
More particularly, the memory performance of a multi-threaded program depends primarily on three factors, namely the shared cache, shared data, and thread interleaving with respect to data access. In general, a shared cache is a dynamic space in which cache blocks are fetched and replaced in response to accesses by different threads. Performance depends on the access location, as well as the access rate and amount of data access. With respect to performance impacts that result from cache usage, threads positively interact when shared data is brought into the cache by one thread and subsequently used by one or more other threads. Threads negatively interfere with one another when non-shared accesses contend for shared cache resources.
Cache interleaving refers to each thread's accessing of the cache during its execution time. For example, threads with uniform interleaving uniformly alternate their cache usage, while threads that carry out asymmetrical tasks produce irregular (non-uniform) interleaving.
The performance of applications running on multicore processors is thus significantly affected by on-chip caches. However, exhaustive testing of various applications on such machines (e.g., 32, 64, 128 and so forth cores) is not always feasible, as machines with fewer cores (e.g., 4-core or 8-core) machines are far more available in test environments than are the larger, expensive multicore machines that need to be used in an enterprise's commercial operations. An accurate cache locality model for multi-threaded applications that quantifies how concurrent threads interact with the memory hierarchy and how their data usage affects the efficiency and scalability of a system is thus very useful in evaluating software and hardware design decisions, and improving scheduling at the application, operating system, virtual machine, and hardware levels.
This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards a technology by which the execution traces corresponding to cache data accesses by a plurality of threads (e.g., operating in a test environment) are used to determine a model for predicting cache locality in a computing environment having a larger number of threads. In one aspect, the model is based upon a probability that the distance between one thread's access of the same block of data in the cache (reuse distance) will increase because of accesses by other threads, and upon another probability that the reuse distance will decrease because of intercept accesses by other threads to data blocks that are shared with the one thread. In one aspect, determining the probabilities is based upon estimating a first set of data blocks that are always shared, estimating a second set of data blocks that are possibly shared, and/or estimating a third set of data blocks that are private.
Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
Various aspects of the technology described herein are generally directed towards analyzing cache behavior to predict locality, without requiring exhaustive simulation and/or simulation on machines having the same number of cores as a test machine. To this end, the locality of concurrent applications is modeled by the change in the distance of data reuses in the cache, in a manner that considers data sharing and/or non-uniform thread interleaving among threads.
It should be understood that any of the examples described herein are non-limiting examples. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used various ways that provide benefits and advantages in computing, caching and analysis in general.
As is known, before accessing an external memory 110, the processor 106 may access the cache 108 to more efficiently locate data. A thread execution tracing mechanism 112 records the cache data accesses as thread execution traces 114. As described below, via one or more models, an analysis mechanism 116 processes the thread traces 114 to provide output data 116 that quantifies how concurrent threads interact with the cache/memory hierarchy and how their data usage affects the efficiency and scalability of a system. The output data 116 may be used to statistically estimate how the programs will operate if scaled to a machine with more cores and thus more concurrent threads.
By way of background, cache reuse refers to how long (not necessarily in time but in number of intervening accesses) it takes for the same block (object) of cached data to be accessed by one or more threads. For a single thread, this is straightforward to measure by tracing the thread's accesses of each block. In one model, for each memory access, the reuse distance is the number of distinct data elements accessed between a given access and the previous access to the same data. For example, in
In general, in a multi-threaded system, threads interleave their cache accesses with other threads during each window of execution. Via tracing, a thread's reuse distance, as affected by accesses from other threads, can be used to build a histogram. However, in doing so, the technology described herein recognizes that the total amount of data accessed in each window of execution may be increased because other threads are also accessing the cache while the one thread is idle or concurrently executing, that is, the reuse distance increases for that thread. Further, the technology described herein also recognizes that the total amount of data accessed in each window of execution may be reduced as a result of shared data accesses, that is, to avoid double counting. In other words, the reuse distance of shared data for a thread get may be shortened due to an intercept, (an intercept refers to an access to a common data block by another thread).
As will be understood, the reuse distance for a thread is thus based on the probability that other threads will increase the reuse distance for data that is not shared, minus the probability that other the threads will decrease the reuse distance for data that is shared. Thus, part of the model is based on determining (e.g., by estimating) how many data blocks of execution traces are probably shared (represented by w), how many are always shared (represented by x), and how many are always private (represented by p); this may be referred to as a WXP model herein.
To this end, the model described herein computes a set of composable, per-thread metrics in a single pass over a concurrent execution, which includes at least four threads (to solve four unknowns). For a system of p threads, the model approximates O(2p) sharing relations between threads with 2p+2 quantities. The per-thread data sharing model allows modeling concurrent executions that only involve a subset of application threads or executions that include a larger number of similar application threads. In addition to computing miss rates for a shared cache, one extension permits modeling of coherence misses in the interleaved execution on a partitioned cache. There is thus described a composable per-thread data sharing model that is scalable and allows investigating concurrent executions with a smaller or larger number of similar threads. Also described is a model for irregular thread interleaving, which is integrated with the data sharing model.
More particularly, to account for the negative cache interference from other threads, data sharing and thread interleaving are handled. In addition, the system models the positive effect of sharing via a composable model that distinguishes patterns of data sharing. Based on a single pass over an interleaved execution trace, a set of per-thread parameters are computed that may be used to predict performance for various cache sizes, e.g., for sub-clusters of threads or for future environments with a larger number of similar threads.
As used herein, one locality metric is the reuse signature, which is the distribution of the reuse distances. More formally, a reuse signature is a pair including a series of (consecutive and non-overlapping) ranges and a series of frequency counts <R, C>=<r1r2 . . . rn, c1c2 . . . cn>. For example, if r1 is [0, 0] and c1 is 2, two references have a reuse distance in the range [0, 0]. Often the frequency counts are weighted with the total number of data reuses and the locality represented as <R, P>=<r1r2 . . . rn, p1p2 . . . pn>. Then, if a memory reference is chosen at random, the probability its reuse distance is in range ri is pi. The reuse signature can be used to calculate the miss rate for fully associative least-recently-used (LRU) cache of all sizes.
A memory reference hits if and only if the reuse distance is less than the cache size. Through a known probabilistic conversion, the miss rate for direct-mapped and set-associative cache may be estimated from the reuse signature.
One locality model described herein comprises five components for each thread of a concurrent program: the reuse signature of the thread, its reuse time map, footprints, the interleaving and the data sharing with other threads. These components, generally represented in
By way of example, consider the data access trace for two threads, with the left side of
Four quantities are used to compute the new reuse distance R′1. The first is the original R1 in the sequential trace of Thread 1. The second is the time window T(R1) in Thread 1, which is called the reuse time of R1.
Another function, M, determines how accesses from the two threads interleave in the concurrent execution. Given this, T2=M(T(R1)) is the relevant time window in Thread 2 where Thread 2's accesses affect the new reuse distance R′1. Also, the number of distinct data blocks accessed in T2 is called the footprint of T2, which is denoted as F(T2). The new reuse distance is R′1=R1+F(M(T(R1))). In other words, the reuse distance is lengthened by the footprint of the time window in Thread 2 that coincides with the time of the reuse in Thread 1. This may be used to compute the overall locality of Thread 1 by applying this formula to its reuse distances and generating a new distribution.
Under the assumption of uniform interleaving, the length of the coinciding time windows is computed as
where N1 and N2 are the length of the trace of the two threads. If the reuse distances and time windows are represented by their lengths,
When there are k threads, the new reuse distance R′ is the original R plus the footprint from the other threads, as shown by the following equation:
where Ti is the reuse time map in Thread i, and Fp is the footprint map of Thread p. The Ti value computes the expected time window for a reuse distance, and Fp computes the expected footprint for a time window. The equation uses a constant ratio and summation because of the assumption (for illustration purposes) that threads are uniformly interleaved and threads do not share data. A property of the previous model is that cache sharing can never improve the locality because a reuse distance is never shortened.
Numerous reuse distances, the time windows, footprints, and their relations are represented collectively as statistical distributions and their mappings. The reuse signature <R, P> is a distribution, where R is a series of bins representing consecutive ranges, ri=(di, di+1), and P is a series of probabilities pi, and the relation is that pi portion of reuses have distances between di and di+1. In statistical terms, a randomly selected reuse distance has probability pi to be between di and di+1. From a distribution, statements such as the average reuse distance is X and the most probable reuse distance is Y may be made.
A distribution may be implemented as a histogram, which may be designed to use w-wide logarithmic ranges for its bins, where each consecutive power-of-two range is divided into w bins of equal size. For example, if M is the number of data blocks used by an execution, the reuse signature has a total of w log M entries. The histogram is thus logarithmic in size yet the precision can be tuned by controlling w, which is eight in one implementation.
As generally represented in
Similarly, the footprint map is a matrix with bins of time-window sizes in the first dimension and bins of footprints in the second dimension. Each row shows the distribution of footprints for the same time-window size, and each column shows the distribution of time-window sizes for the same footprint. Note that a reuse window accesses the same data at both boundaries, but a time window can access different data. For a trace of length n, there are O(n) reuse windows but O(n2) time windows. If randomly selecting a time window in an execution, it is most likely not a reuse window, so the reuse time map cannot represent the footprint map.
In the application model described herein, a concurrent execution of multiple threads is recorded in an interleaved trace. The reuse distance and its reuse time in each thread are measured by extracting from the interleaved trace the accesses by the thread. The footprint may be measured by random sampling of time windows and recording the relation between the window size and the volume of data accessed in the window.
With respect to thread interleaving, as shown in
The technology described herein recognizes that the interleaving only needs to be measured within the two ends of every reuse distance. As a result, during the simulation pass of the interleaved execution trace, the execution counts are recorded, along with the number of executed instructions for each thread, at the last access of each data element. At each memory reference, the memory reference counts between this and the previous access of the same data are computed; the interleaving relation between each thread and all other threads is stored. The total size is quadratic to the number of threads k, since each thread holds k−1 relations. Note that the quadratic cost may be avoided by using a global virtual time and computing the relative rate of execution in each virtual time window; however, this needs an additional map from the reference count of each thread to its virtual time range, and also needs to measure all windows instead of all reuse windows. An alternative is to use an exponential model that measures the interleaving probability for each thread subset.
The examples herein use the quadratic relations. For example, let B be the number of bins in a time histogram. For each thread t, the interleaving with other k−1 threads is represented by two B×(k−1) matrices: the probability matrix and the ratio matrix. In both matrices, each row is a reuse time bin. The element (bi, tj) is the probability of a reuse window of size bi being concurrent with the execution of Thread tj. When they are concurrent, the element (bi, tj) in the ratio matrix gives the rate of execution of Thread tj relative to t in that window.
One implementation denotes the two matrices as Interleavetprob and Interleavetrate. An example algorithm for computing locality with the interleaving model is shown as follows:
For each bin in the original reuse signature, the algorithm enumerates the possible companion subsets. For each companion set, the algorithm invokes the sharing or no-sharing data model as a subroutine to compute the new reuse distance. It then computes the probability prob that Thread t runs concurrently with (and only with) the companion subset, to thereby weigh the new reuse distance by prob. The subset enumeration includes the empty set, in which case Thread t runs by itself.
The subroutine GetConcurrencyProb uses a standard inclusion/exclusion method and is omitted herein for brevity. The algorithm calls the subroutine GetReuseDis to compute the effect of concurrent execution on ri. It assumes uniform interleaving and uses the formula described above. Note that the uniformity is used only for the same bin in the same companion thread set; it will not be uniform for the whole execution unless every column of the ratio matrix has the same value, and every number in the probability matrix equals one (1). In general, this model effectively accounts for a full range of possible interleaving.
Turning to aspects related to concurrent threads that share data in the cache, one model captures data sharing in a composable manner. As will be understood, this can be used to compute the miss rate of the interleaved concurrent execution when accessing a shared cache.
In one implementation, the per-thread data accesses are modeled by dividing them into three components, including shared data, which has the same size (denoted x) for all threads. This represents shared data that is always accessed in the given scenario. For example, global constants, the root node of a B-tree, or the first node of a shared (linked) list are always accessed whenever related data is needed.
A second type is potentially shared data, of which each thread accesses from the same pool. The size of the shared pool, denoted w, is the same for all threads. The portion of data blocks in w accessed by thread i is mi (0≦mi≦1). This category represents shared data that is not necessarily accessed by all threads, such as the leaf nodes of shared tree structures.
A third component is the private data accessed by each thread, denoted as size pi for thread i. Then, the size of data accessed by thread i, or the data universe of i, ui, is the sum of the three components, ui=x+miw+pi.
For a system of k threads, the model approximates these O(2k) relations with 2k+2 quantities, namely two numbers per thread, pi and mi, and two numbers for the system, x and w. In addition, the impact of additional threads may be investigated by modeling their data sharing with these parameters pi and mi.
The model can approximate asymmetrical sharing. If a sub-group of threads share more data than they do with others, they will have higher mis. The model does not constrain the size of private data in threads. Each can have any size pi. The model is reasonably precise in approximating the exponential numbers of data sharing relations with a linear-size model. A significant complication, even in this simplified model, is that none of the 2p+2 numbers can be directly measured. For example, if a data block is accessed by all threads in a concurrent execution, the block could be part of x (always shared data), but it could also be part of w (potentially shared data) if it just happened to be accessed by all threads. Similarly, if a block is accessed by only one thread, it may belong to either w or p (private data).
The model may be built by running the interleaved execution once and recording the set of data blocks accessed by each thread. Then the amount of data sharing between any subset of threads is computed. The computation cost is O(M2k) and the space cost is O(M), where M is the number of data blocks. In one implementation, the execution trace is simulated only once; these measured sets are termed as universes.
The average size of the shared universe between two threads, three threads, and four threads is computed as follows: let the three numbers be ū2, ū3, and ū4. The average is
ū2≈x+
ū3≈x+
ū4≈x+
is computed:
With this average, x and w are solved using any of the above two (now linear) equations for ū2, ū3, and ū4. Any two may be chosen because a solution to any two is the solution to all three.
A last step is to compute the approximation of mi and pi: let ui,j denote the size of shared universe by threads i and j. Then
for q≠i, j. To improve precision, the ratio may be approximated by the average from all q. Once there is an estimate for
for all j≠1, the system computes
and then
Also, pi=ui−miw−x, where ui is the size of the data accessed by thread i.
As can be seen, the data sharing model requires at least four threads to solve a non-linear equation to obtain these hidden parameters, and as mentioned above, the model is referred to as WXP for its three components. While WXP approximates the cumulative sharing over an entire trace, note that it can be used as the average of active data sharing for any execution window. Further, the model enables composing thread behavior in an abstract but quantitative manner, e.g., a thread may be “cloned” by giving a new thread the same WXP, which means that the new thread accesses shared and private data in the same way as the original. In this way the behavior of hypothetical thread groups may be studied independent of their sizes.
The composable data sharing model may be used to compute the shared cache miss rate. A general task computes the effect of concurrent execution on each reuse distance of each thread. The overall miss rate can then be computed by iterating the solution on all reuse distances and all threads; note a reuse distance refers to a reuse-distance bin in the histogram, which represents the group of reuses whose distance falls within a range.
Consider the example in
The algorithm for the k-thread case is given in the following algorithm for computing the effect of concurrent execution on reuse distance r in Thread t, when t is executed with k−1 other threads, with shared cache and shared data, (where Thread t is running with Threads t1, . . . , tk−1):
The inputs to the model include the reuse-distance bin r, the reuse-time map of Thread t, the footprint map, the data sharing model, i.e. the WXPs, and the data size for all threads, and the ratio of instruction rates between other threads and Thread t. The last parameter gives the relative size of execution windows among threads. In uniform interleaving,
where Nti and Nt are the length of trace ti and t respectively. In one solution, the ratio is supplied by the interleaving model described above.
Algorithm 2 includes nine statements that can be divided into three stages. Stage 1 computes the footprint for the (k−1) peer threads, and includes three statements, namely finding the length of the time window of r in Thread 1, finding the coinciding time windows in peer threads, and finding the footprint in these windows.
Stage 2 computes the reuse distance assuming no intercept. It has four statements, including decomposing r based on the WXP model, decomposing footprints fti based on the WXP model, computing the size of shared data datashared, and adding it to the size of private data to obtain the result, disno intercept.
Note that one step in Stage 2 is for estimating the overlap in shared-data access. From the WXP model, the shared-data components in the reuse distance r and each footprint fti are separated. Consider the X components: rx from the reuse distance and ft
smax=max(rx,ft
sunion is bounded by two extreme cases, namely the size of full data sharing, smax, where the union is as large as the largest component; and no data sharing, ssum, where the union is the total size of all components.
An estimate of sunion is made by taking the joint probability, that is, by taking
as the probability that an x block belongs to a component, and computing sunion as the union of their probabilities using the inclusion-exclusion principle. Let p1, p2, . . . , pk be the probabilities. The union punion is:
Then, sunion=punionUx, where Ux is the size of the x data in the WXP model. Note that Ux is cumulative sharing and not necessarily the active sharing, especially in small execution windows. Different methods of estimating the working set size may be used (note that because different threads may execute a very different number of instructions during a time window, it is unclear what a common working set means). The sum of all components ssum=sum(rx, ft
The shared portion of W data is computed in the same way. The total shared data, datashared, is the sum of the two. The reuse distance without an intercept, disno intercept, is the sum of private data accessed by all threads plus datashared, as shown by the last statement in Stage 2 of the algorithm 2.
When predicting for a large number of threads, the exponential-cost inclusion/exclusion computation is approximated by first calculating the average probability and then computing as if all threads have this probability. The approximation may overestimate the amount of data sharing, however the time cost is linear to the number of threads.
Stage 3 of the algorithm computes the effect of intercepts. The probability and the frequency of intercepts depend on the overlap among shared-data components. Two heuristics may be used, namely that the average probability of the intercept is the relative size of shared data in Thread t, and the number of the intercepts is the number of thread peers. The denominator of k when computing disintercept is because k−1 threads cause k−1 intercepts that divide a time window into k sections. Note that algorithm 1 is used to compute the overall locality of the concurrent execution, but with GetShareReuseDis (described in Algorithm 2) replacing GetReuseDis.
An extension for modeling coherence misses allows extending the model to compute locality for partitioned caches. If each thread accesses a separate partitioned cache, then the only way for another thread to impact its locality is by writing to a shared data item that results in a cache invalidation (ignoring the impact of write to read downgrades). This situation only impacts the cache miss rate computation when the second thread intercepts the reuse distance of the first thread (as in
Exemplary Operating Environment
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer 510 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 510 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 510. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above may also be included within the scope of computer-readable media.
The system memory 530 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 531 and random access memory (RAM) 532. A basic input/output system 533 (BIOS), containing the basic routines that help to transfer information between elements within computer 510, such as during start-up, is typically stored in ROM 531. RAM 532 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 520. By way of example, and not limitation,
The computer 510 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, described above and illustrated in
The computer 510 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 580. The remote computer 580 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 510, although only a memory storage device 581 has been illustrated in
When used in a LAN networking environment, the computer 510 is connected to the LAN 571 through a network interface or adapter 570. When used in a WAN networking environment, the computer 510 typically includes a modem 572 or other means for establishing communications over the WAN 573, such as the Internet. The modem 572, which may be internal or external, may be connected to the system bus 521 via the user input interface 560 or other appropriate mechanism. A wireless networking component 574 such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN. In a networked environment, program modules depicted relative to the computer 510, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
An auxiliary subsystem 599 (e.g., for auxiliary display of content) may be connected via the user interface 560 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state. The auxiliary subsystem 599 may be connected to the modem 572 and/or network interface 570 to allow communication between these systems while the main processing unit 520 is in a low power state.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
6070009 | Dean et al. | May 2000 | A |
6282613 | Hsu et al. | Aug 2001 | B1 |
6292934 | Davidson et al. | Sep 2001 | B1 |
7356805 | Ding et al. | Apr 2008 | B2 |
7454666 | Jordan et al. | Nov 2008 | B1 |
7647585 | Sun | Jan 2010 | B2 |
7802236 | Calder et al. | Sep 2010 | B2 |
7805580 | Hirzel et al. | Sep 2010 | B2 |
8127280 | Thomas et al. | Feb 2012 | B2 |
8141058 | Berg et al. | Mar 2012 | B2 |
20020112227 | Kramskoy et al. | Aug 2002 | A1 |
20040111708 | Calder et al. | Jun 2004 | A1 |
20040154011 | Wang et al. | Aug 2004 | A1 |
20060074970 | Narayanan et al. | Apr 2006 | A1 |
20070067573 | Bruening et al. | Mar 2007 | A1 |
20070124540 | van Riel | May 2007 | A1 |
20070234016 | Davis et al. | Oct 2007 | A1 |
20090125465 | Berg et al. | May 2009 | A1 |
Number | Date | Country |
---|---|---|
2008058292 | May 2008 | WO |
Entry |
---|
Erik Berg, “A Statistical Multiprocessor Cache Model”, 2006, pp. 11. |
Erik Berg, “A probabilistic approach to efficient and accurate data locality analysis”, 2004. |
Weinberg, Jonathan, “Quantifying Locality in the Memory Access Patterns of HPC Applications”, A thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Computer Science, University of California, San Diego, Dated 2005, 50 Pages. |
Lin, et al., “Data Locality Analysis and Optimization”, Workshop Program, 5th International Workshop on Software and Compilers for Embedded Systems, Session 6: Program Analysis and Optimization, Dated : Mar. 2001, 5 Pages. |
Jiang, et al., “COCA: A Locality-Aware Cooperative Cache Management Protocol to Improve Network File System Performance”, In Proceedings of the 26th International Conference on Distributed Computing Systems: 42, Dated: 2006, 25 Pages. |
Chen, et al., “Mapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics”, Third Workshop on Unique Chips and Systems,ISPASS2007 Dated: Apr. 2007, pp. 1-8. |
Beyls, et al., “Visualization Enables the Programmer to Reduce Cache Misses”, In Proceedings of Conference on Parallel and Distributed Computing and Systems, Dated: 2002, 6 Pages. |
Agarwal, et al., “An Analytical Cache Model”, ACM Transactions on Computer Systems, vol. 7, No. 2, Dated: May 1989, pp. 184-215. |
Bennett, et al., “Adaptive Software Cache Management for Distributed Shared Memory Architectures”, Special Issue: Proceedings of the 17th annual international symposium on Computer Architecture, ACM SIGARCH Computer Architecture News, vol. 18 , Issue 3a, Dated : Jun. 1990, pp. 125-134. |
Blume, et al., “Parallel Programming with Polaris”, This paper appears in: Computer, vol. 29, Issue 12, Publication Date: Dec. 1996, pp. 78-82. |
Chandra, et al., “Predicting Interthread Cache Contention on a Chip Multi-Processor Architecture”, Proceedings of the 11th International Symposium on High-Performance Computer Architecture, Year of Publication: 2005, pp. 340-351. |
Ding, et al., “Predicting Whole-Program Locality with Reuse Distance Analysis”, Conference on Programming Language Design and Implementation, Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, Session: Code optimization II, Year of Publication: 2003, pp. 245-257. |
Eggers, et al., “A Characterization of Sharing in Parallel Programs and its Application to Coherency Protocol Evaluation”, ACM SIGARCH Computer Architecture News, vol. 16 , Issue 2, Dated: May 1988, pp. 373-382. |
Falsafi, et al., “Modeling Cost/Performance of a Parallel Computer Simulator”, ACM Transactions on Modeling and Computer Simulation, vol. 7, No. 1, Dated: Jan. 1997, pp. 104-130. |
Hall, et al., “Interprocedural Parallelization Analysis in SUIF”, ACM Transactions on Programming Languages and Systems, vol. 27, No. 4, Dated: Jul. 2005, pp. 662-731. |
Hill, et al., “Evaluating Associativity in CPU Caches”, Computers, IEEE Transactions on Computers, vol. 38, No. 12, Dated: Dec. 1989, pp. 1612-1630. |
Hiranandani, et al., “Compiling Fortran D for MIMD Distributed-Memory Machines”, Communications of the ACM, vol. 35 , Issue 8, Dated: Aug. 1992, pp. 66-80. |
Jeremiassen, et al., “Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations”, Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, Year of Publication: 1995, pp. 179-188. |
Lu, et al., “Compiler and Software Distributed Shared Memory Support for Irregular Applications”, ACM SIGPLAN Notices, vol. 32 , Issue 7, Dated: Jul. 1997, pp. 48-56. |
Mattson, et al., “Evaluation Techniques for Storage Hierarchies”, IBM System 1. 9, 2, Dated: 1970, pp. 78-117. |
Smith, Alan Jay, “On the Effectiveness of Set Associative Page Mapping and its Applications in Main Memory Management”, Proceedings of the 2nd international conference on Software engineering, Year of Publication: 1976, pp. 286-292. |
Suh, et al., “Analytical Cache Models with Applications to Cache Partitioning”, In the proceedings of the 15th International Conference on Supercomputing, Dated: Jun. 2001, 14 Pages. |
Thiebaut, et al., Footprints in the Cache, ACM Transactions on Computer Systems (TOCS), vol. 5, Issue 4 Dated: Nov. 1987, pp. 305-329. |
Number | Date | Country | |
---|---|---|---|
20100107142 A1 | Apr 2010 | US |