This disclosure relates generally to the throughput of data structures in programs, and, more particularly, to methods and apparatus to optimization the processing throughput of data structures in programs.
In various applications a processor is programmed to process (e.g., read, modify and write) data structures (e.g., packets) flowing through the device in which the processor is embedded. For example, in network applications a network processor processes packets (e.g., reads and writes packet header, accesses packet layer-two header to determine packet type and necessary actions, accesses layer-three header to check and update time to live (TTL) and checksum fields, etc.) flowing through a router, a switch, or other network device. In a video server example, a video processor processes streaming video data (e.g., encoding, decoding, re-encoding, verifying, etc.). To achieve high performance (e.g., high packet processing throughput, large number of video channels, etc.), the program executing on the processor must be capable of processing the incoming data structures in a short period of time.
Many processors utilize a multiple level memory architecture, where each level may have a different capacity, access speed, and latency. For example, an Intel® IXP2400 network processor has external memory (e.g., dynamic random access memory (DRAM), etc.) and local memory (e.g., static random access memory (SRAM), scratch pad memory, registers, etc.). The capacity of DRAM is 1 Gigabyte with an access latency of 120 processor clock cycles, whereas the capacity of local memory is only 2560 bytes but with an access latency of 3 processor cycles.
Often, data structures to be processed have to be stored prior to processing. In applications requiring large quantities of data (e.g., network, video, etc.), usually the memory level with the largest capacity (e.g., DRAM) is used as a storage buffer. However, the long latency in accessing data structures stored in a slow memory level (e.g., DRAM) leads to inefficiency in the processing of data structures (i.e., low throughput). It has been recognized that, for high latency memory levels, the number of accesses to a data structure has a more direct impact on the processing throughput of data structures than the size (e.g., number of bytes) of the accesses. For example, for a Level 3 (L3) network switch application running on an Intel® IXP2400 network processor to support an Optical Carrier Level 48 (OC48) packet forwarding rate, the processor cannot have more than three 32 byte DRAM accesses in each thread (assuming one thread per Micro Engine (ME) running in a eight-thread context with a total of eight MEs).
It can be a significant challenge for application developers to carefully, explicitly, and manually (re-)arrange all data structure accesses in their application program code to meet such strict data structure access requirements.
FIG.7 is a schematic illustration of an example manner of implementing the data structure access optimizer of
FIGS. 9A-C are flowcharts representative of example machine readable instructions which may be executed to implement the data structure access tracer of
FIGS. 10A-B are flowcharts representative of example machine readable instructions which may be executed to implement the data structure access analyzer of
FIGS. 11A-B are flowcharts representative of example machine readable instructions which may be executed to implement the data structure access optimizer of
To reduce data structure access time (i.e., increase processing throughput of data structures), due to slow memory (i.e., memory with high access latency), during execution of an example program, the program is modified to reduce the number of data structure accesses to the slow memory. In one example, this is accomplished by inserting one or more new program instructions to copy a data structure (or a portion of the data structure) from the slow memory to a fast (i.e., low latency) memory, and by modifying existing program instructions to access the copy of the data structure from the fast memory. Further, if the copy of the data structure in the fast memory is anticipated to be modified, added to, or changed by the program, one or more additional program instructions are inserted to copy the modified data structure from the fast memory back to the slow memory. The additional program instructions are inserted at processing end or split points (e.g., an end of a subtask, a call to another execution path, etc.).
It should be readily apparent to persons of ordinary skill in the art that portions of the program to be optimized can be selected using any of a variety of well known techniques. For example, the portions of the program may represent: (1) program instructions that are critical (e.g., as determined by a profiler, or known a priori to determine the processing throughput of data structures), (2) program instructions that are assigned to particular computational resources or units (e.g., to a ME of an Intel® IXP2400 network processor), and/or (3) program instructions that are considered to be cold (seldomly executed). Further, the portions of the program to be optimized may be determined using any of a variety of well known techniques (e.g., by the programmer, during compilation, etc.). Thus, in discussions throughout this document, “optimization of the program” is used, without restriction, to mean optimization of the entire program, optimization of multiple portions of the program, or optimization of a single portion of the program.
To identify and characterize anticipated data structure accesses in the program, the DSAT 210 of
To characterize the anticipated data structure accesses in each execution path, the DSAA 215 of
To optimize the data structure accesses, the DSAO 220 uses the aggregate data structure access information determined by the DSAA 215 to determine where and what program instructions to insert to pre-load all or a portion of a data structure, and to determine which and how to modify program instructions to operate on the pre-loaded all or portion of the data structure. If the program is expected to modify the pre-loaded data structure, the DSAO 220 inserts additional program instructions to write-back the modified portion of the data structure. The modified data structure may be written back to the original storage memory or another memory.
As will be readily appreciated by persons of ordinary skill in the art, the example DSTO 200 of
In a second example, the DSAT 210 of
In a third example, the application is programmed for a multi-processor device that partitions the program into subtasks and assigns subtasks to different processing elements. For example, non-critical subtasks could be assigned to slower processing elements. The application may also be pipelined to exploit parallelism, with one stage on each processing element. Because a copy of a data structure in local (i.e., fast) memory cannot be shared across processing elements, pre-load and write-back program instructions are inserted at each processing entry (i.e., start of a subtask) and end (i.e., end of a subtask) point. In particular, the DSAT 210 of
The data structure access recorder 310 records and stores in the memory 225 information representative of the flow of anticipated data structure accesses for each execution path from the entry function to each execution path end point or data send point (i.e., a point where a data structure is sent to another subtask or execution path).
Each entry in the table 400 of
As the data structure access tracer 605 traces through the data access graph, the data structure access tracer 605 provides information to the data structure access annotator 610 and the data structure access aggregator 615. For example, at a data structure read node, the data structure access tracer 605 instructs the data structure access annotator 610 to annotate the corresponding node in the IR tree. The annotations contain information required by the DSAO 220 to perform program instruction modifications (e.g., to translate a data structure read from the storage memory to the local memory, and to translate the read relative to the beginning of the portion of the data structure that is pre-loaded rather than from the beginning of the data structure). In another example, at a call to another subtask the data structure access tracer 605 instructs the data structure access annotator 610 to insert and annotate a new node in the IR tree corresponding to a data structure write-back. It should be readily apparent to persons of ordinary skill in the art that other methods of determining and/or marking program instructions for modification or insertion could be used. For example, the data structure access annotator 610 can insert temporary “marking” codes into the program containing information indicative of changes to be made. The DSAO 220 could then locate the “marking” codes and make corresponding program instruction modifications or insertions.
At each data structure access (read or write) node, the data structure access tracer 605 passes information on the access to the data structure access aggregator 615. The data structure access aggregator 615 accumulates data structure access information for the execution path. For example, the data structure access aggregator 615 determines the required offset and size of a data structure pre-load, and the required offset and size of a data structure write-back. The information accumulated by the data structure access aggregator 615 is used by the DSAO 220 to generate inserted program instructions to realize data structure pre-loads and write-backs.
FIGS. 8, 9A-C, 10A-B, and 11A-B illustrate flowcharts representative of example machine readable instructions that may be executed by an example processor 1210 of
The example machine readable instructions of FIGS. 8, 9A-C, 10A-B, and 11A-B may be implemented using any of a variety of well-known techniques. For example, using object oriented program techniques, and using structures for storing program variables, the IR tree, and the data access graph. In particular, the access entry 500 could be implemented using a “struct”, and the data access graph (i.e., the table 400) and data structure access recorder 315 could be implemented using an object oriented “class” containing public functions to add nodes to the graph (e.g., inserting a data structure access node, inserting a data structure write node, inserting a program call node, inserting an end node, inserting an if node, etc.).
It should be readily apparent to persons of ordinary skill in the art, that the example machine readable instructions of FIGS. 8, 9A-C, 10A-B, and 11A-B can be applied to programs in a variety of ways. In the earlier example of the OC48 L3 switch application executing on an Intel® IXP2400 network processor, there are a variety of choices in how to optimize the program. In a preferred example, only critical execution paths assigned to MEs are optimized, and packet pre-loads and write-backs are inserted at the entry, exit, call, and data send points of each critical execution path. In another example, optimization is performed globally, is applied to all execution paths, packet pre-loads are included at the entry point of a receive module (that receives packets from a network card), and packet write-backs are included at the end point of a transmit module (that provides packets to a network card). In a further example, optimization is performed on a processing element (e.g., ME) basis, and packet pre-loads and write-backs are inserted at the entry and exit points for a processing unit.
The example machine readable instructions of
The example machine readable instructions of FIGS. 9A-C trace the anticipated data structure accesses to create the data access graph. As illustrated in FIGS. 9A-C, the example machine readable instructions of FIGS. 9A-C are performed recursively. The example machine readable instructions of FIGS. 9A-C process each node of the portion of the IR tree for an execution path (typically signified by an entry node in the IR tree) (node 904). The DSAT 210 determines if the node is a data structure access node (block 906). If the node is a data structure access node, the DSAT 210 determines if the access is static (block 908). If the data structure access is static, the DSAT 210 creates a data structure access node in the data flow graph (block 910). Control then proceeds to block 940 of
Returning, for purposes of discussion to block 906, the node is not a data structure access node, the DSAT 210 determines if the node is a call node (block 918). If the node is a call node, the DSAT 210 creates a call node in the data flow graph (block 920) and traces the data structure accesses of the called program (block 921) by recursively using the example machine readable instructions of FIGS. 9A-C. After the recursive execution returns (block 921), control proceeds to block 940 (
Returning, for purposes of discussion to block 918, the node is not a call node, the DSAT 210 determines if the node is a data send (i.e., a transfer of a data structure to another execution path) node (
Returning, for purposes of discussion to block 922, the node is not a data send node, the DSAT 210 determines if the node is an if (i.e., conditional) node (block 930). If the node is an if node (block 930), the DSAT 210 traces the data structure accesses of the if path (block 931) by recursively using the example machine readable instructions of FIGS. 9A-C. After the recursive execution returns (block 931), the DSAT 210 then creates an if node in the data flow graph (block 932), and traces the data structure accesses of the then path (block 933) by recursively using the example machine readable instructions of FIGS. 9A-C. After the recursive execution returns (block 933), the DSAT 210 next traces the data structure accesses of the else path (block 934) by recursively using the example machine readable instructions of FIGS. 9A-C. After the recursive execution returns (block 934), the DSAT 210 then joins the two paths in the data flow graph (block 935) and control proceeds to block 940 of
Returning, for purposes of discussion to block 930, the node is not an if node, the DSAT 210 determines if the node is a return, end of execution path, or data structure drop (e.g., abort, ignore modifications, etc.) node (block 936 of
The example machine readable instructions of FIGS. 10A-B analyze the data access graph and annotate the IR tree. As illustrated in FIGS. 10A-B, the example machine readable instructions of FIGS. 10A-B are performed recursively. The example machine readable instructions of FIGS. 10A-B process each node of a portion of the data flow graph for an execution path (block 1002). The DSAA 215 determines if the node is a data structure access node (block 1004). If the node is an access node (block 1004), then the DSAA 215 updates the information representative of the aggregate accesses of the data structure (block 1006), and annotates the corresponding IR node (block 1008). Control then proceeds to block 1024 of
Returning, for purposes of discussion to block 1004, the node is not a data structure access node, the DSAA 215 determines if the node is a call or data send node (block 1010). If the node is a call or data send node (block 1010), the DSAA 215 adds a write-back node to the IR tree (block 1012) and the DSAA 215 annotates the new write-back node (block 1016). Control then proceeds to block 1024 of
Returning, for purposes of discussion to block 1010, the node is not a call or data send node, the DSAA 215 determines if the node is an if node (block 1017). If the node is an if node (block 1017), the DSAA 215 recursively analyzes the portion of the data access graph for the then path and annotates the IR tree using the example machine readable instructions of FIGS. 10A-B (block 1018). After the recursive execution returns (block 1018), the DSAA 215 then recursively analyzes the portion of the data access graph for the else path and annotates the IR tree using the example machine readable instructions of FIGS. 10A-B (block 1019). After the recursive execution returns (block 1019), the DSAA 215 then merges (i.e., combines) the information representative of the aggregate accesses of the data structure for the then and else paths (block 1020). Control then proceeds to block 1024 of
Returning, for purposes of discussion to block 1017, the node is not an if node, the DSAA 215 recursively analyzes the portion of the data access graph for the other path (i.e., the portion of the data access graph starting with the node) and annotates the IR tree using the example machine readable instructions of FIGS. 10A-B (block 1022). After the recursive execution returns (block 1022), control proceeds to block 1024.
After all data flow graph nodes for the execution path have been processed (block 1024), the DSAA 215 processes all nodes in the IR tree (block 1026). The DSAA 215 determines if the node is an execution path entry node (block 1028). If the node is an entry node (block 1028), the DSAA 215 adds a data structure pre-load node to the IR tree (block 1030) and annotates the added pre-load node with the information representative of the aggregate read data structure data accesses (block 1032) and control proceeds to block 1034. At block 1034, the DSAA 215 determines if all IR tree nodes have been processed. If so, the DSAA 215 ends the example machine readable instructions of FIGS. 10A-B. Otherwise, control returns to block 1002 of
It will be readily apparent to persons of ordinary skill in the art that the example machine readable instructions of FIGS. 9A-C and 10A-B could be combined and/or executed simultaneously. For example, the DSTO 200 could annotate the IR tree while tracing the anticipated data structure accesses in the program. In particular, the recorded representative information could be retained only long enough to be analyzed and corresponding IR tree annotations created. In this fashion, the recorded representative information is not necessarily stored (i.e., retained) in a table, data structure, etc.
The example machine readable instructions of FIGS. 11A-B modify the program based on the annotated IR tree to optimize the processing throughput of data structures. The example machine readable instructions of FIGS. 11A-B process each node of the annotated IR tree (block 1102). The DSAO 220 determines if the node is a data structure pre-load node (block 1104). If the node is a data structure pre-load node (block 1104), the DSAO 220 reads the annotation information from the pre-load node (block 1106) and inserts into the program pre-load program instructions corresponding to the annotation information (block 1108). Control proceeds to block 1132 of
Returning, for purposes of discussion to block 1104, the node is not a pre-load node, the DSAO 220 determines if the node is a data structure write-back node (block 1110). If the node is a write-back node (block 1110), the DSAO 220 reads the annotation information for the node (block 1112) and determines if modifications to the data structure are dynamic or static (block 1114). If modifications are dynamic (block 1114), the DSAO 220 inserts program instructions to create a run-time variable that tracks what portion(s) of the data structure has been modified (block 1116), and then control proceeds to block 1118. Returning, for purposes of discussion to block 1114, the modifications are not dynamic, the DSAO 220 inserts program instructions to perform the data-structure write-back (block 1118), and control then proceeds to block 1132 of
Returning, for purposes of discussion to block 1110, the node is not a write-back node, the DSAO 220 determines if the node is a data structure access node (block 1120 of
Returning, for purposes of discussion to block 1124, the access is dynamic, the DSAO 220 inserts and modifies the program code to verify that accesses of the data structure access the correct memory level (e.g., access the local memory for the pre-loaded portion), and to access the data structure from the correct memory level (block 1130). Control then proceeds to block 1132.
Returning, for purposes of discussion to block 1124, the node is not an access node, control proceeds to block 1132. The DSAO 220 determines if all nodes have been processed (block 1132). If all nodes of the IR tree have been processed (block 1132), the DSAO 220 ends the example machine readable instructions of FIGS. 11A-B. Otherwise, control returns to block 1102 of
The processor platform 1200 of the example includes the processor 1210 that is a general purpose programmable processor. The processor 1210 executes coded instructions present in a memory 1227 of the processor 1210. The processor 1210 may be any type of processing unit, such as a microprocessor from the Intel® Centrino® family of microprocessors, the Intel® Pentium® family of microprocessors, the Intel® Itanium® family of microprocessors, and/or the Intel XScale® family of processors. The processor 1210 includes a local memory 1212. The processor 1210 may execute, among other things, the example machine readable instructions illustrated in FIGS. 8, 9A-C, 10A-B, and 11A-B.
The processor 1210 is in communication with the main memory including a read only memory (ROM) 1220 and/or a RAM 1225 via a bus 1205. The RAM 1225 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM), Dynamic DRAM, and/or any other type of RAM device. The ROM 1220 may be implemented by flash memory and/or any other desired type of memory device. Access to the memory space 1220, 1225 is typically controlled by a memory controller (not shown) in a conventional manner. The RAM 1225 may be used by the processor 1210 to implement the memory 225, and/or to store coded instructions 1227 that can be executed to implement the example machine readable instructions illustrated in FIGS. 8, 9A-C, 10A-B, and 11A-B
The processor platform 1200 also includes a conventional interface circuit 1230. The interface circuit 1230 may be implemented by any type of well known interface standard, such as an external memory interface, serial port, general purpose input/output, etc. One or more input devices 1235 are connected to the interface circuit 1230. One or more output devices 1240 are also connected to the interface circuit 1230.
Of course, one of ordinary skill in the art will recognize that the order, size, and proportions of the memory illustrated in the example systems may vary. For example, the user/hardware variable space may be larger than the main firmware instructions space. Additionally, although this patent discloses example systems including, among other components, software or firmware executed on hardware, it should be noted that such systems are merely illustrative and should not be considered as limiting. For example, it is contemplated that any or all of these hardware and software components could be embodied exclusively in hardware, exclusively in software, exclusively in firmware or in some combination of hardware, firmware and/or software. Accordingly, while the above described example systems, persons of ordinary skill in the art will readily appreciate that the examples are not the only way to implement such systems.
Although certain example methods, apparatus and articles of manufacture have been described herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all methods, apparatus and articles of manufacture fairly falling within the scope of the appended claims either literally or under the doctrine of equivalents.
This patent arises from a continuation of International Patent application No. PCT/US05/21702, entitled “Methods and Apparatus to Optimize Processing Throughput of Data Structures in Programs” which was filed on Jun. 05, 2005. International Patent application No. PCT/US05/21702 is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US05/21702 | Jun 2005 | US |
Child | 11549745 | Oct 2006 | US |