Embodiments of the present invention relate generally to pre-fetching objects for use with an object oriented program.
An example of an object oriented programming language is Java® from Sun Microsystems Incorporated. A Java virtual machine can give Java programs a software-based computer they can interact with. Because the Java virtual machine is not a real computer but exists in software, a Java program can run on any physical computing platform, such as Windows, Macintosh, Linux, Unix or any other system equipped with a Java virtual machine.
Object-oriented programming languages use generalized categories, called classes, that describe a group of more specific items called objects. Classes can define fields that are used by objects. Objects are specific instances of a class that can include values for the fields defined by the class.
A system running a virtual machine can include cache memory, and main memory. Cache memory can be memory located on a computer's processor. A cache hit can occur when data to be read is stored in cache memory. A cache miss occurs when data to be read is not stored in cache memory.
Main memory is typically a memory located outside a processor. Storing data used by a program in cache memory prior to the data being read can increase the speed of a system in some embodiments by not having to read data from the main memory.
An object can reference another object. When an object references another object, a load can be performed to retrieve a field from an object or a group of objects. If the object cannot be located in the cache memory, the virtual machine can make an access to a computer's main memory to retrieve the object; however, this can negatively influence performance.
A hardware performance monitor can be a circuit used to oversee the performance behavior of a system. A hardware performance monitor in some embodiments can identify a potentially small subset of delinquent objects, which are objects that are frequently missed in a cache memory.
Identifying most of the delinquent objects and chains of delinquent objects can be beneficial to the efficiency of a virtual machine. A virtual machine that can be used in an embodiment is available from BEA Systems Incorporated of San Jose, Calif.
Objects can contain multiple fields. For example, if an object A identified a person, and object B identified another person, a field within object A might include the first person's name and a field within object B might include the second person's name. A load, which is an instruction to read data from memory for use by a program, can read a field stored by an object. For example, a load of the first person's name can be referenced as A.Name, where A identifies the object of the first person and Name identifies the field in object A storing the name of the first person. For the second person, a load referenced as B.Name can read the name field of the second object. A bit can be set in the header of object A and object B as the first instance of the load is performed to identify objects A and B as delinquent objects of a delinquent load.
Delinquent loads refer to target data addresses that are frequently missed in the cache memory. A virtual machine can use many more objects than loads. A hardware performance monitor can identify most of the delinquent loads but not most of the delinquent objects because there can be fewer loads than objects.
Cache size, processor speed or frequency of use of data can be some of the variables used to determine if an address should be identified by the hardware performance monitor as frequently missed in the cache. For example, in one embodiment if an address is missed, the address is identified as frequently missed and in other embodiments an address can result in a cache miss a percentage of the time before the address is identified as frequently missed. The percentage can be determined by the hardware performance monitor, in one embodiment.
A hardware performance monitor can be used to capture instructions whose target addresses frequently miss in the cache memory. The capture may be performed during dynamic profile-guided optimization operations using a given hardware performance monitor, and more particularly using hardware of the monitor such as data event address registers (EARs). A hardware performance monitor that can be used is included with Itanium® processors available from Intel Corporation of Santa Clara, Calif.
Referring now to
As shown in
A hardware performance monitor can identify addresses that are accessed and not frequently present in the cache memory. The hardware performance monitor can collect the information regarding addresses frequently not stored in the faster memory e.g., via sampling. A virtual machine can use this information obtained by from the hardware performance monitor to generate instrumentation, which can be a set of instructions or code that is inserted into a program code. The instrumentation code may mark a header of an object to identify the object as a delinquent object of a delinquent load. A header is data in an object that identifies the object. In various embodiments, a user-defined analysis may be performed by the VM to find the chains of delinquent loads.
The instrumentation code inserted in the program code can thus mark the object that contains the field identified by the address that is absent from the cache memory. In one embodiment, delinquent objects may be marked in their object headers using at least one bit, although the scope of the present invention is not so limited. One bit may be used to specify a delinquent root. When using two bits, a first bit can identify the object as delinquent or not and the second bit can identify the object as a root or a child. Since such instrumentation can be performed on all instances, most of the delinquent object chains can be captured.
In some embodiments, a Java virtual machine can pre-fetch objects based on object type. The object type can be, for example, all objects from the same class. If an object references another object outside of that type or class, the object that is being referenced can result in a cache miss because only the objects of the same type were prefetched. A Java virtual machine can also pre-fetch addresses in memory located after an address that is being fetched. Marking objects as delinquent so that a chain can be formed allows pre-fetching of an entire chain of frequently delinquent loads when the load of the first field in the chain is performed.
Pre-fetching a chain of delinquent objects can begin by a reference (i.e., a load) for an address corresponding to a root object. The root object can be the first loaded field in a chain of loaded fields. In the context of delinquent loads, this root load is the first load in a chain of delinquent loads, i.e., loads for data that are frequently absent from a cache memory. In one embodiment, the root object can be fetched along with all of the child objects via a pre-fetch operation. The pre-fetch operation can pre-fetch the child objects by using the markings in the object header of the object with the field being accessed by the load. The markings of the child objects can be added by the instrumentation. A compiler can define likely chains or trees of delinquent objects. A compiler can be used to identify child loads to create a chain of delinquent objects when the child objects are not marked by the instrumentation. A tree of delinquent objects can include branches from previous objects.
The compiler can use static analysis based on where the delinquent loads are located to determine which objects are the roots and which are the children of a chain. If the children are not marked, a static graph can be used to follow the root to the child. For example, A.Name can then give the next object, B. The chain or tree can be created by the object references instead of dependent delinquent loads because in some embodiments pre-fetching of dependent delinquent loads can pre-fetch loads that were not delinquent because they were pre-fetched with previously loaded objects sharing a cache line or non-dependent delinquent loads can still load dependent objects that share a cache line.
A root load can begin the pre-fetching of a delinquent object chain identified by the root load. The root load and the child loads can be read from memory and stored in the cache memory. Thus when a prefetch operation occurs during execution, because the object from the child load has already been pre-fetched, the virtual machine does not have to read main memory to retrieve the child object.
A chain can start from the object a and end with object c. The pre-fetching of object a at an address a can result in multiple cache lines after a being fetched. For ease of illustration a cache line in this example is 128 bytes, but the cache line can be any number of bytes. An offset of address a can be determined according to the tree size in bytes and the size of a cache line. Data at an original address a can be fetched at block 5, and then multiple cache lines after the original address, for example, at a+128, . . . , a +(floor(treeSize/128)−1)* 128 and a+treeSize−1 can be pre-fetched when a is fetched, also at block 5. The instruction, floor, removes a fraction from the value calculated from the tree size divided by the size of a cache line, leaving an integer value. The pre-fetching instructions at block 5 pre-fetch the data from address a to the last byte of the tree represented by (a+treesize−1). Note the prefetch code of block 5 may be inserted based on instrumentation that identifies root and child objects via markings in accordance with one embodiment, and may be inserted by a compiler in accordance with an embodiment of the present invention.
Thus object A can be the root of the chain and a load of A.F (i.e., field F of object A) can result in a cache miss at block 10. However, by pre-fetching address a to the address of the last byte of the chain or tree, a load of B by A.F, and a load of c by B.F, can result in cache hits, at blocks 15 and 20. Between the blocks 5, 10, 15 and 20 can be additional program code. Thus using dynamic profile-guided prefetching, the prefetch code of block 5 may be inserted at a point well before the data items are needed in execution of the code. This point may be determined based on hardware performance monitoring data, as discussed above.
At block 10, a read of an address plus an offset represented by A.F can be done and the value read at address A.F can be data that is stored in local variable B. The load at block 15 can store in local variable C the contents of memory located at the address stored in B.F. The load at block 20 can store in local variable I the contents of memory located at the address stored in C.F. In the example, the local variable I is loaded by the code using a chain of objects A-B-C in blocks 10 through 20. Pre-fetching the root object and the child objects of the chains into a cache memory can reduce the time that it takes to load integer I.
Loading the physical memory after A can result in objects that are not part of the chain A-B-C being loaded into cache memory and taking space in the cache memory.
Referring now to
For example, object B can be located at an offset of 2,348,320 bytes from object A. Thus, a pre-fetch of object A and the next four cache lines, such as that shown in block 5 of
Pre-fetching of a reorganized object chain or tree 305 can be done by pre-fetching the root object and the following memory that can be determined by adding together the size of the root object and the child objects of the chain or tree. Pre-fetching a size equal to the size of the objects of the chain or tree added together can allow objects that are members of the chain or tree to be fetched without fetching objects that are not part of the chain or tree.
The Java virtual machine 105 can execute a program within an object 125. The object 125 can load other objects or fields from other objects. The other objects can be identified by an address in memory. The cache memory 120 can be checked first for the address of the field that is being loaded. If the address of the field is not located within the cache memory, the load address can be considered delinquent. The hardware performance monitor 115 can identify this address as a delinquent address. The main memory 140 can then be accessed to load the field of the object 145 identified by address 150 (for example).
The object 145 at address 150 can be marked in the object header as a delinquent root. The hardware performance monitor 115 can identify other objects with fields that are being loaded. The other objects with fields that are going to be loaded in a chain with root object 145 can be identified by marking in the object header as a delinquent child. For example, root object 145 identified by address 150 can be the beginning of a chain of delinquent objects. The chain can include a child object such as child object 155 identified by child address 160.
Identifying chains of delinquent objects can reduce the cache miss rate, in some embodiments. The chain of objects which include the root object 145 and the child object 155 can be pre-fetched into cache memory 120 when a load for a field 135 identified by address 130 is performed. The root object 145 and the child object 155 can be of different types or of different classes, in some embodiments.
Still referring to
A marking of delinquent chains of objects can be helpful in performing a garbage collection operation. Objects can be moved when garbage collection is performed. The objects can be in half the memory and when the memory is filled the objects are copied to the other half of the memory. The live objects can be copied when the copying is performed. The dead objects, or objects where nothing is pointing to them, can remain at the previous location and not moved or copied to the new location. The garbage collection may begin at block 215. Different of performing garbage collection may be implemented in different embodiments. In one embodiment a so-called mark-sweep-compact garbage collection may be performed. Such a garbage collection may implement an external, whole, heap compaction. In this way, objects that are live can be marked, and these live objects may then be moved to another location in memory, and then the remaining portions of memory outside of this portion can be reused. Child objects can be determined at garbage collection time if the child objects are not marked by the instrumentation.
To perform the garbage collection operation, a marking phase in accordance with such a mark-sweep-compact garbage collection routine can implement recursive tracing (block 218). Specifically, when a root delinquent object is encountered, all connected delinquent child objects that have yet to be claimed by other roots may be recursively traced (block 218). Then, delinquent child objects that have not been claimed by other roots can be marked and a hash table entry for each object can be recorded at block 220. The entry can include for a child, its root, and the future offset from that root. An entry for a root can include the root and the total chain size (e.g., in bytes). At the same time, the children and root objects can be marked as ready to prevent other roots from claiming the children. In one embodiment, such ready marking may be indicated by a ready bit being set in the object header of each of the objects.
Next, during a compaction phase, space can be allocated for a chain when a delinquent ready object is encountered at block 225. More specifically, if the encountered object is the first encountered object from its chain, the space may be allocated. Furthermore, the cache entry for the root may be updated to reflect this change. Then objects can then be copied to a new location, referenced via the hash table at block 230. During the garbage collection, delinquent child, root, and ready bits can be unmarked as the objects are copied to their new location. At the conclusion of copying the objects, the hash table can be cleared (still block 230).
Note that in various embodiments, when compaction is performed (i.e., external compaction), objects may be copied in forward order so that the allocation order of objects is maintained. By copying the chained objects in allocation order, later prefetching that is done on the chain objects can provide for the insertion of the correct objects into a cache memory via a minimal amount of prefetching. Note that if instead compaction were performed in which the relative order of objects was reversed, a prefetch such as that shown above in
Accordingly, at the conclusion of garbage collection, control passes from block 230 to block 235, where a chain of delinquent loads that has had garbage collection performed on it in accordance with the present invention may be prefetched into a cache memory (block 235). Note that the operation taking place at each block 200 through 235 can be repeated for other loads, objects and chains while operations are being performed. For example, other objects can be marked at block 205 while garbage collection is being performed on a chain already identified at block 200 and marked at block 205.
Referring back to
Thus, as described above, the chains of ready objects can be copied to the new location in the order the objects were stored in at the previous location. Copying the chains of objects to a new location in an order different than the order the objects existed in the previous location (e.g., by backwards copying) may cause the chains to be pre-fetched incorrectly.
The loader 400 can read from the cache memory 405 for a field at an address. If the address does not exist in the cache memory 405 the main memory 410 can be read by the cache memory 405 and data at the address can be loaded in the cache memory 405. The monitor 415 can identify addresses that are not present in the cache memory 405. The instrumentation 420 can use this address information from the monitor 415 to identify the objects that contain the field located at the address not found in the cache memory 405. The instrumentation 420 can mark at least one bit in the header of such objects. If the loader 400 loads the field at the address the compiler 425 can pre-fetch the chain of delinquent objects from memory 410 and store them in cache memory 405. The loader 400 can load the addresses of the chain from the cache memory 405 after the compiler has pre-fetched the chain, improving performance.
In various embodiments, one or more of loader 400, monitor 415, instrumentation 420, and compiler 425 may be implemented in software, such as a machine-readable medium including instructions to perform such operations.
Embodiments may be implemented in code and may be stored on a storage medium having stored thereon instructions which can be used to program a system to perform the instructions. The storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
References throughout this specification to “one embodiment” or “an embodiment” mean that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation encompassed within the present invention. Thus, appearances of the phrase “one embodiment” or “in an embodiment” are not necessarily referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be instituted in other suitable forms other than the particular embodiment illustrated and all such forms may be encompassed within the claims of the present application.
While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.