Portable computing devices (e.g., cellular telephones, smart phones, tablet computers, portable digital assistants (PDAs), and portable game consoles) continue to offer an ever-expanding array of features and services, and provide users with unprecedented levels of access to information, resources, and communications. To keep pace with these service enhancements, such devices have become more powerful and more complex. Portable computing devices now commonly include a system on chip (SoC) comprising one or more chip components embedded on a single substrate (e.g., a plurality of central processing units (CPUs), graphics processing units (GPU), digital signal processors, etc.).
It is desirable for such multi-processor devices or other computing systems (e.g., desktop computers, data server nodes, etc.) to be able to profitably parallelize application code running on the device based on code cost analysis. Existing cost code analysis techniques and solutions for parallelizing application code, however, rely on simple cost heuristics, which may not be able to analyze complex control flow or provide adequate runtime profitability checks.
Accordingly, there is a need in the art for improved systems, methods, and computer programs for providing parallelization of application code at runtime.
Various embodiments of methods, systems, and computer programs are disclosed for performing runtime auto-parallelization of application code. One embodiment of such a method comprises receiving application code to be executed in a multi-processor system. The application code comprises an injected code cost computation expression for at least one loop in the application code defining a serial workload for processing the loop. A runtime profitability check of the loop is performed based on the injected code cost computation expression to determine whether the serial workload can be profitably parallelized. If the serial workload can be profitably parallelized, the loop is executed in parallel using two or more processors in the multi-processor system.
Another embodiment is a system for performing runtime auto-parallelization of application code. The system comprises a plurality of processors and a runtime environment configured to execute application code via one or more of the plurality of processors. The runtime environment comprises an auto-parallelization controller configured to receive the application code to be executed via one or more of the processors. The application code comprises an injected code cost computation expression for at least one loop in the application code defining a serial workload for processing the loop. The auto-parallelization controller performs a runtime profitability check of the loop based on the injected code cost computation expression to determine whether the serial workload can be profitably parallelized. If the serial workload can be profitably parallelized, the auto-parallelization controller executes the loop in parallel using two or more processors.
In the Figures, like reference numerals refer to like parts throughout the various views unless otherwise indicated. For reference numerals with letter character designations such as “102A” or “102B”, the letter character designations may differentiate two like parts or elements present in the same Figure. Letter character designations for reference numerals may be omitted when it is intended that a reference numeral to encompass all parts having the same reference numeral in all Figures.
The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
In this description, the term “application” or “image” may also include files having executable content, such as: object code, scripts, byte code, markup language files, and patches. In addition, an “application” referred to herein, may also include files that are not executable in nature, such as documents that may need to be opened or other data files that need to be accessed.
The term “content” may also include files having executable content, such as: object code, scripts, byte code, markup language files, and patches. In addition, “content” referred to herein, may also include files that are not executable in nature, such as documents that may need to be opened or other data files that need to be accessed.
As used in this description, the terms “component,” “database,” “module,” “system,” and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device may be a component. One or more components may reside within a process and/or thread of execution, and a component may be localized on one computer and/or distributed between two or more computers. In addition, these components may execute from various computer readable media having various data structures stored thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the Internet with other systems by way of the signal).
The computing device 102 may comprise one or more processors 110 coupled to a memory 112. The memory 112 may comprise an integrated development environment (IDE) 118. The IDE 118 comprises one or more software applications that provide comprehensive facilities to computer programmers for software development. The IDE 118 may include, for example, a source code editor, various build automation tools, a debugger, and a compiler 120. The compiler 120 may further comprise code cost analysis (CCA) and optimization module(s) 122. The CCA module(s) 122 may execute as part of the compiler's optimization engine. As known in the art, the compiler 120 compiles application source code 302 (
The CCA module(s) 122 comprise the logic and/or functionality for implementing various CCA algorithms configured to process the application source code 302, identify code loops, and compute the code costs associated with the code loops. As described below in more detail, the CCA algorithms may be configured to perform partial or static code cost computations and generate code cost computation expressions 144. The code cost computation expressions 144 are injected into the compiled application code 124 and may be used, at runtime, to determine whether a loop may be profitably parallelized. In this regard, the application code 124 may be compiled with a serial code version 142 and a parallelized code version 143 for code loops. At runtime, the serial code version 142 may be used when a code loop is to be executed using a single processor 126. If the code loop may be profitably parallelized, the parallelized code version 143 may be used to execute the loop in parallel using two or more processors 126.
One of ordinary skill in the art will appreciate that the term “profitable” in the context of application code refers to a more desirable final implementation of application code than an original existing implementation. For example, “profitable” may refer to a final implementation of an application code that runs in less time than the original, consumes less memory than the original, or consumes less power than the original, although there may be other embodiments of profitability based on other desirable goals.
The term “profitably parallelized” refers to a piece of sequentially executed code that may be parallelized or executed in parallel and is expected to demonstrate some measure of profitability as a result.
It should be appreciated that the term “runtime auto-parallelization” may be independent of a specific point in time when auto-parallelization may occur. For example, auto-parallelization may occur at compile time or at runtime. In this description, the term “runtime auto-parallelization” refers to the decision, at runtime, of executing application code either in its original sequential form or in a parallel form. The decision may be, for instance, to always or never execute the parallel form of the application. In other instances, the decision may be made based on information available only at runtime.
At block 210, the runtime environment 141 receives the compiled application code 124 comprising the code cost computation expression(s) 144 and the serial code version 142 and the parallelized code version 143 for code loops. At block 212, the auto-parallelization controller 138 may perform a runtime profitability check 140 based on the code cost computation expressions 144 injected in the application code 124 by the compiler 120. At decision block 214, the auto-parallelization controller 138 may determine for each code loop whether parallelization will be profitable. If “yes”, at block 216, the auto-parallelization controller 138 may initiate parallel execution of a code loop via two or more processors 126 using, for example, the parallelized code version 143. If “no”, at block 218, the auto-parallelization controller 138 may initiate serial execution of a code loop via a single processor 126 using, for example, the serial code version 142.
In this regard, it should be appreciated that the CCA module(s) 122 and the auto-parallelization controller 138 may support various code cost use cases depending on the nature of the application code, the runtime environment 141, etc. For example, the CCA algorithms may determine that a first type of loop (Loop 1) cannot be parallelized, in which case the runtime environment 141 may always execute Loop 1 using a single processor 126. For a second type of loop (Loop 2), the CCA algorithms may determine that the loop may always be profitably parallelized because, for example, all loop trip counts may be statically resolved. In this use case, the runtime environment 141 may always execute Loop 2 in parallel using two or more processors 126. As described below in more detail, a third use case involves a loop (Loop 3) for which the CCA algorithms cannot statically resolve all loop trip counts. In this scenario, the CCA algorithms compute a code cost computation expression 144 for the Loop 3, which is injected into the application code 144 and used by the runtime environment 144 to perform the runtime profitability check 140 and determine whether the Loop 3 may be profitably parallelized. If based on the runtime profitability check 140 and a number of available processors 126 it is determined that parallelization would be profitable, Loop 3 may be executed in parallel using the available processors 126. If, however, parallelization would not be profitable, Loop 3 may be executed using a single processor 126.
In other words, it should be appreciated that the runtime profitability check 140 determines whether the loop comprises enough work (e.g., instruction cycles, execution time, etc.) such that it may be profitably parallelized. In an embodiment, the runtime profitability check 140 may implement Equation 1 below.
(W/N+O)<W
If (W/N+O)<W, it is determined that the loop may be profitably parallelized (i.e., Loop 3 type). If (W/N+O) is greater than or equal to W, it is determined that the loop may not be profitably parallelized (i.e., Loop 2 type).
As illustrated in
As mentioned above, in certain situations, the amount of work in the loop (W) may be completely determined at compile time. However, if the amount of work in the loop (W) cannot be completely determined at compile time, the CCA algorithms 122 generate the code cost computation expression 144 and inject it into the application code. For example, consider the situation in which the application code 124 comprises a loop for processing a picture/photo to be selected by the user 108. The execution cost (e.g., the number of instructions executed) of the loop may depend on the size of the image selected (e.g., width, height, resolution). The CCA algorithms 122 may generate a code cost computation expression 144 comprising a numerical expression. The numerical expression may be represented according to Equation 2 below.
W=S+R;
It should be appreciated that the relationship between S and R may vary depending on, for example, loop trips counts, loop execution counts, inter-loop dependences etc. and, therefore, may be represented according to any mathematical formula.
In this regard, it should be appreciated that an external profiling process may be implemented for collecting information related to the behavior of the program or application code (referred to as “profiling information”). Profiling information may comprise, for example, total loop trip counts, average loop trip counts, total number of times a branch is taken, probability of a branch begin taken, number of times a function is invoked, and equivalent forms from which such data may be determined. Profiling information may also include other types of information, such as, for example, power consumption information during execution, memory bandwidth requirements, memory access patterns, and hardware counter events. The profiling process may be performed in various ways. In one exemplary implementation, the profiling process may be performed by application code instrumentation made by compiler transformations or external tools, such as, execution tracers, hypervisors, and/or virtual machines.
In the embodiment illustrated in
It should be appreciated that the CCA modules 122 are configured to statically compute as much of the code cost as possible at compile time based on the DAG 401 (referred to as static or partial code cost computations). In an embodiment, the CCA modules 122 compute the cost of each cost unit node in the DAG 401 in a bottom-up manner. The cost of children nodes is aggregated at the parent node level based on the type of node (i.e., loop, conditional, basic block). The cost of a basic block may be determined based on the category of instructions (e.g., computation instructions, write memory access instructions, read memory access instructions, etc.). The cost of an if-else construct may be computed as the minimum cost of the “taken” and the “not taken” paths or, in the presence of profiling information, as a statistical method with the input of profiling information. It should be appreciated that the term “minimum cost” of the “taken” and the “not taken” paths may refer to the use of a statistical method in presence of profiling information. The cost of a loop may be computed as the summation of children costs multiplied by the loop trip count.
As mentioned above, there are situations in which the control flow construction of the DAG 401 does not enable all of the loop trip counts to be statically resolved. In these instances, a portion of the code cost may be automatically computed at runtime by generating the code cost computation expression 144 (at compile time) and injecting it in the application code 124. Referring again to the exemplary code 400 illustrated in
Referring to the first example (
Referring to the second example (
Referring to the third example (
Referring to
a
n
=a
1+(n−1)d Equation 8
The total number of iterations for the inner loop may be equal to the sum of the arithmetic sequence for its first N terms. The total number of iterations of the inner loop may be represented according to Equation 9 below:
S
n
=[n(a1+an)]/2; wherein
S
n
=N*(3+N+3)/2 Equation 10
Equation 10 is the specialization of Equation 9 on the example case.
It should be appreciated that, if profiling information about loop execution is available and there is a profiled trip count value, the following approach may be implemented. In presence of profiling information, for loops with dynamic trip counts the profiled trip counts may be used and the cost of the loop may be estimated as it would be by having a static trip count. In this regard, there may be two scenarios. First, if the loop can be determined profitable based on the profiled trip count value, the loop may be treated as having a static trip count in which case the trip count of the loop is static. Second, if the profiled trip count does not indicate that the code is profitable for parallelization, the profiled information may be ignored. In this regard, the cost estimation and profitability may be applied with the above-described techniques for loops with dynamic trip counts. One of ordinary skill in the art will appreciate that other methods and techniques may be implemented. In an embodiment, the above-described methods and techniques may be modified to accommodate different profitability needs and/or performance strategies.
The system 100 may be incorporated into any desirable computing system.
A digital camera 348 may be coupled to the processors 126. In an exemplary aspect, the digital camera 348 is a charge-coupled device (CCD) camera or a complementary metal-oxide semiconductor (CMOS) camera. A stereo audio coder-decoder (CODEC) 350 may be coupled to the processors 126. Moreover, an audio amplifier 352 may coupled to the stereo audio CODEC 350. In an exemplary aspect, a first stereo speaker 354 and a second stereo speaker 356 are coupled to the audio amplifier 352. A microphone amplifier 358 may be also coupled to the stereo audio CODEC 350. Additionally, a microphone 360 may be coupled to the microphone amplifier 358. In a particular aspect, a frequency modulation (FM) radio tuner 362 may be coupled to the stereo audio CODEC 350. Also, an FM antenna 364 is coupled to the FM radio tuner 362. Further, stereo headphones 366 may be coupled to the stereo audio CODEC 350.
Referring to
Certain steps in the processes or process flows described in this specification naturally precede others for the invention to function as described. However, the invention is not limited to the order of the steps described if such order or sequence does not alter the functionality of the invention. That is, it is recognized that some steps may performed before, after, or parallel (substantially simultaneously) with other steps without departing from the scope and spirit of the invention. In some instances, certain steps may be omitted or not performed without departing from the invention. Further, words such as “thereafter”, “then”, “next”, etc. are not intended to limit the order of the steps. These words are simply used to guide the reader through the description of the exemplary method.
Additionally, one of ordinary skill in programming is able to write computer code or identify appropriate hardware and/or circuits to implement the disclosed invention without difficulty based on the flow charts and associated description in this specification, for example.
Therefore, disclosure of a particular set of program code instructions or detailed hardware devices is not considered necessary for an adequate understanding of how to make and use the invention. The inventive functionality of the claimed computer implemented processes is explained in more detail in the above description and in conjunction with the Figures which may illustrate various process flows.
In one or more exemplary aspects, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted as one or more instructions or code on a computer-readable medium. Computer-readable media include both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that may be accessed by a computer. By way of example, and not limitation, such computer-readable media may comprise RAM, ROM, EEPROM, NAND flash, NOR flash, M-RAM, P-RAM, R-RAM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that may be used to carry or store desired program code in the form of instructions or data structures and that may be accessed by a computer.
Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (“DSL”), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium.
Disk and disc, as used herein, includes compact disc (“CD”), laser disc, optical disc, digital versatile disc (“DVD”), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
Alternative embodiments will become apparent to one of ordinary skill in the art to which the invention pertains without departing from its spirit and scope. Therefore, although selected aspects have been illustrated and described in detail, it will be understood that various substitutions and alterations may be made therein without departing from the spirit and scope of the present invention, as defined by the following claims.
This application claims the benefit of the priority of U.S. Provisional Patent Application No. 62/081,465, entitled “Systems, Methods, and Computer Programs for Performing Runtime Auto-Parallelization of Application Code,” filed on Nov. 18, 2014 (Attorney Docket No. 17006.0379U1), which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62081465 | Nov 2014 | US |