The present disclosure is related to hardware-software co-design.
As technology scaling becomes a challenge, System-on-Chip (SoC) architects are exploring the capabilities of Domain-Specific SoCs (DSSoCs) to effectively balance performance and flexibility. DSSoC architectures are characterized by a heterogeneous collection of general-purpose cores and programmable accelerators tailored to a particular application domain. The uniqueness of DSSoC architectures gives rise to a number of challenges.
First, the design and implementation of hardware accelerators is time-consuming and complex. DSSoCs are characterized by application domains with recurring compute and/or energy-intensive routines, and an effective DSSoC will require a collection of accelerators built specifically to handle these. Hardware implementation and functional verification of custom accelerators while meeting area, timing, and power constraints at the system-level remains a significant challenge.
Second, DSSoCs commonly operate in real-time environments where time-constrained applications arrive dynamically. For a fixed collection of heterogeneous accelerators, this requires dynamic and low-overhead scheduling strategies to enable effective runtime management and task partitioning across these accelerators. A common approach in enabling rich scheduling algorithms that maximize processing element (PE) utilization is to model applications as directed acyclic graphs (DAGs). Assuming DAG-based applications, the complexity of managing a large collection of task-dependencies and prioritizing execution across a variety of custom and general-purpose PEs makes scheduling a non-trivial problem in DSSoCs.
Third, like any heterogeneous platform, it is crucial to provide productive toolchains by which application developers can port their applications to DSSoCs. In particular, target applications must be analyzed in terms of their phases of execution, and the portions of each application that are amenable to heterogeneous execution must be mapped as such to the various resources present on a given DSSoC. Providing application developers a rich environment by which they can explore different application partitioning strategies contextualized by realistic scheduler models and accelerator interfaces is critical in enabling efficient execution on production hardware.
Finally, in a production DSSoC, effective on-chip communication is crucial to exploit maximum performance with minimum latency and energy consumption. Hence, there is a need for efficient Network-on-Chip (NoC) fabric that is tailored for a given DSSoC's collection of accelerators. Together with the aforementioned challenges, it is a complex task to design and evaluate DSSoC architectures.
A user-space emulation framework for heterogeneous system-on-chip (SoC) design is provided. Embodiments described herein propose a portable, Linux-based emulation framework to provide an ecosystem for hardware-software co-design of heterogeneous SoCs (e.g., domain-specific SoCs (DSSoCs)) and enable their rapid evaluation during the pre-silicon design phase. This framework holistically targets three key challenges of heterogeneous SoC design: accelerator integration, resource management, and application development. These challenges are addressed via a flexible and lightweight user-space runtime environment that enables easy integration of new accelerators, scheduling heuristics, and user applications, and the utility of each is illustrated through various case studies.
With signal processing (WiFi and RADAR) as the target domain, this framework is used to evaluate the performance of various dynamic workloads on hypothetical heterogeneous SoC hardware configurations composed of mixtures of central processing unit (CPU) cores and Fast Fourier Transform (FFT) accelerators using a Zynq UltraScale+™ MPSoC. The portability of this framework is shown by conducting a similar study on an Odroid platform composed of big.LITTLE ARM clusters. Finally, a prototype compilation toolchain is introduced that enables automatic mapping of unlabeled C code to heterogeneous SoC platforms. Taken together, this environment offers a unique ecosystem to rapidly perform functional verification and obtain performance and utilization estimates that help accelerate convergence towards a final heterogeneous SoC design.
An exemplary embodiment provides an emulation environment for heterogeneous SoC design. The emulation environment includes a workload manager configured to schedule application tasks onto heterogeneous processing elements (PEs) in a heterogeneous SoC based on a scheduling policy and a resource manager configured to simulate a test hardware configuration using the heterogeneous PEs and execute the application tasks scheduled by the workload manager.
Another exemplary embodiment provides a method for developing an application for heterogeneous SoC implementation. The method includes obtaining an application code, converting the application code into a platform-independent hardware representation, and generating an object notation-based representation of the application code for heterogeneous SoC implementation from the platform-independent hardware representation.
Those skilled in the art will appreciate the scope of the present disclosure and realize additional aspects thereof after reading the following detailed description of the preferred embodiments in association with the accompanying drawing figures.
The accompanying drawing figures incorporated in and forming a part of this specification illustrate several aspects of the disclosure, and together with the description serve to explain the principles of the disclosure.
The embodiments set forth below represent the necessary information to enable those skilled in the art to practice the embodiments and illustrate the best mode of practicing the embodiments. Upon reading the following description in light of the accompanying drawing figures, those skilled in the art will understand the concepts of the disclosure and will recognize applications of these concepts not particularly addressed herein. It should be understood that these concepts and applications fall within the scope of the disclosure and the accompanying claims.
It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
It will be understood that when an element such as a layer, region, or substrate is referred to as being “on” or extending “onto” another element, it can be directly on or extend directly onto the other element or intervening elements may also be present. In contrast, when an element is referred to as being “directly on” or extending “directly onto” another element, there are no intervening elements present. Likewise, it will be understood that when an element such as a layer, region, or substrate is referred to as being “over” or extending “over” another element, it can be directly over or extend directly over the other element or intervening elements may also be present. In contrast, when an element is referred to as being “directly over” or extending “directly over” another element, there are no intervening elements present. It will also be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being “directly connected” or “directly coupled” to another element, there are no intervening elements present.
Relative terms such as “below” or “above” or “upper” or “lower” or “horizontal” or “vertical” may be used herein to describe a relationship of one element, layer, or region to another element, layer, or region as illustrated in the Figures. It will be understood that these terms and those discussed above are intended to encompass different orientations of the device in addition to the orientation depicted in the Figures.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes,” and/or “including” when used herein specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms used herein should be interpreted as having a meaning that is consistent with their meaning in the context of this specification and the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
A user-space emulation framework for heterogeneous system-on-chip (SoC) design is provided. Embodiments described herein propose a portable, Linux-based emulation framework to provide an ecosystem for hardware-software co-design of heterogeneous SoCs (e.g., domain-specific SoCs (DSSoCs)) and enable their rapid evaluation during the pre-silicon design phase. This framework holistically targets three key challenges of heterogeneous SoC design: accelerator integration, resource management, and application development. These challenges are addressed via a flexible and lightweight user-space runtime environment that enables easy integration of new accelerators, scheduling heuristics, and user applications, and the utility of each is illustrated through various case studies.
With signal processing (WiFi and RADAR) as the target domain, this framework is used to evaluate the performance of various dynamic workloads on hypothetical heterogeneous SoC hardware configurations composed of mixtures of central processing unit (CPU) cores and Fast Fourier Transform (FFT) accelerators using a Zynq UltraScale+™ MPSoC. The portability of this framework is shown by conducting a similar study on an Odroid platform composed of big.LITTLE ARM clusters. Finally, a prototype compilation toolchain is introduced that enables automatic mapping of unlabeled C code to heterogeneous SoC platforms. Taken together, this environment offers a unique ecosystem to rapidly perform functional verification and obtain performance and utilization estimates that help accelerate convergence towards a final heterogeneous SoC design.
The present disclosure proposes an open-source, portable user-space emulation framework that seeks to address the first three challenges of accelerator design, resource management, and application development in the early, pre-silicon stages of heterogeneous SoC (e.g., DSSoC) development. This framework is a lightweight Linux application that is designed to be suitable for emulating heterogeneous SoCs on various commercial off-the-shelf (COTS) computing systems. For the above three challenges, it provides distinct plug-and-play integration points where developers can individually integrate and evaluate their applications, schedulers, and accelerator IPs in a realistic and holistic system before a full virtual platform or platform silicon is made available.
Notably, to enable rapid application integration, the framework also includes a prototype compilation toolchain that allows users to map monolithic, unlabeled C applications to directed acyclic graph (DAG)-based applications as an alternative to requiring hand crafted, custom integration for each application in a domain. On top of enabling functional verification for each of these aspects of a heterogeneous SoC separately, this unified environment assists in deriving relative performance estimates among different combinations of applications, scheduling algorithms, and heterogeneous SoC hardware configurations. These estimates are expected to assist SoC developers narrow their configuration space prior to performing in depth, cycle-accurate simulations of a complete system and accelerate convergence to a final heterogeneous SoC design.
Section II introduces the proposed framework and describes the functionality of its key components. The interfaces required to integrate new schedulers, applications and processing elements (PEs) are also described. Section III presents various use-cases of the emulation framework based on real applications from the signal processing domain on COTS platforms. Section IV presents a computer system used for implementing embodiments described herein.
At the start of an emulation, the emulation framework 10 performs an initialization phase in which the application handler 12 initializes a queue containing the required workload, and allocates the memory required by the emulation workload in the main memory. In the same phase, the resource manager 16 initializes the target heterogeneous SoC configuration by using the real PEs 18 in the underlying SoC. Post the initialization phase, the workload manager 14 drives the emulation by dynamically injecting the applications from the workload queue and coordinating with the resource manager 16 to schedule tasks on the idle PEs 18. Before termination, the emulation framework 10 collects the scheduling statistics for all the applications and their tasks. These statistics can later be used to evaluate the performance of the emulated heterogeneous SoC. The communication between different PEs 18 is performed using the shared memory 20 of the platform. As a result, while this framework can assist in hardware, scheduler, and application design, it currently is limited in its ability to handle hypothetical Network-on-Chip (NoC) architectures. The subsequent subsections present details of all the components in the emulation framework 10 and detail the steps that must be taken to integrate new features.
A. Application Handler
The DAG key in
As an example, a user may wish to execute three instances of range detection in validation mode. Given this request, the emulation framework 10 will parse all available applications, and it will output an error if, at the end of this process, it has not detected range detection as referenced by its AppName. Assuming the emulation framework 10 was able to find and parse the archetypal instance of range detection, it will then instantiate three copies of this base application. Each application instance will have all its variables allocated and initialized as described in the JSON. After initialization, the application will be enqueued into a workload queue and passed to the workload manager 14 to emulate application arrival and scheduling.
To integrate new applications, a developer has three choices. First, they can build a DAG-based application entirely from scratch, compile it into a shared object of kernels, and link them together with a hand-crafted JSON-based DAG representation. Second, they can choose to leverage the existing library of kernels present in other applications and define a new application simply by linking them together in a novel way. In this way, many application domains can be rapidly implemented through piecemeal combinations of common kernels solely through defining how they become linked together. Third, a developer can utilize an automated workflow provided as a part of the emulation framework 10 that allows for automatic, if less optimized, conversion from monolithic C code into DAG-based applications. Further details about the functionality and capabilities of this third option are presented in Section II-D.
B. Workload Manager
The workload manager 14 drives the emulation in the emulation framework 10. It is responsible for tracking the emulation time, injecting applications, implementing scheduling policies, and coordinating with the resource managers 16 to execute the tasks on the PEs 18. The workload manager 14 uses the workload queue from the application handler 12 and the task scheduling algorithm from the user as its inputs. At run-time, the user is given the option to select either one of the available scheduling policies from the library or use the custom scheduling algorithm. The default scheduling library is composed of minimum execution time (MET), first ready-first start (FRFS), earliest finish time (EFT), and random (RANDOM).
To utilize a user-defined scheduling policy, an additional policy needs to be defined in scheduler.cpp and a dispatch call needs to be added in the same file's performScheduling function. This new policy must accept parameters such as the ready queue of tasks and handles for each of the “resource handler” objects. Each task consists of a DAG node data structure with all the information necessary for scheduling, dispatch, and measurement of a single node's performance throughout the emulation framework 10. Each resource handler object is associated with a unique PE 18. It is composed of fields that track PE 18 availability, type, and ID along with its workload and synchronization lock. The PE availability field is used to communicate resource state between the workload manager 14 and resource manager 16. A PE's availability status can be idle, run, or complete. A thread monitoring or modifying the status field should acquire the PE's synchronization lock, read or write to the status field, and release the lock. Integrating a new scheduling algorithm should begin by checking the availability for all the PEs 18 by querying whether their status field indicates they are idle. Next, the algorithm performs the task-to-PE mapping on the ready tasks and transfers them over to the resource manager 16 of their mapped PEs 18 via resource handlers. Then, the algorithm commands the resource manager 16 to start executing the task by modifying the PE state to run (block 412). The resource manager 16 notifies the task completion to the workload manager 14 by modifying the status to complete (block 414). Post notification, the workload manager 14 appends the outstanding tasks in the ready list and updates PE status to idle (block 416).
C. Resource Manager
At the start of emulation, the emulation framework 10 reads the number and types of PEs 18 from the input configuration file and initializes the dedicated threads of the resource manager 16 for each PE 18. These threads are responsible for controlling the operations on their assigned PEs 18. These operations involve the execution of the assigned task, manage the data transfer in between the main memory and the custom accelerator (if required), and coordinate the PE availability status with the workload manager 14. If the input PE type is CPU, then the emulation framework 10 assigns the affinity of its resource manager 16 thread to one of the unused CPU cores in the underlying SoC. For all other PE types, their resource manager 16 thread assignment begins with the unused CPU cores and then they are evenly distributed among all the CPU cores in the resource pool. To derive relative performance estimates, it is recommended to instantiate a test configuration such that all of the resource manager 16 threads are assigned to a separate CPU core to reduce the impact of context switching among the threads.
D. Automatic Application Conversion
As an alternative to requiring hand-crafted DAG-based applications, a basic toolchain is also provided that allows for automatic conversion of monolithic, unlabeled C applications to DAG-based applications through a combination of dynamic tracing-based kernel/node detection and LLVM code outlining.
With the code instrumented, a tracing executable is compiled that dumps a runtime trace of its application behavior to disk (block 602). This trace is then analyzed through the TraceAtlas toolchain, and it identifies what sections of the code should be labeled as “kernels” or “non-kernels”, where a “kernel” is a set of highly correlated IR-level blocks from the original source code that execute frequently in the base program (block 604). In a broad sense, they are analogous to labeling “hot” sections in the source program. With this information, the original file can be partitioned into alternating groups of “cold”/“non-kernel” code and “hot”/“kernel” code.
This information is then passed through an in-house tool, built on LLVM's CodeExtractor module, that uses the information about these code groups to automatically refactor the LLVM IR into a sequence of function calls, where each function call invokes the proper group of blocks necessary to recreate the original application behavior. Additionally, this in-house tool analyzes the memory requirements of the original application by identifying both static memory allocation in terms of variable declarations as well as dynamic memory allocation by attempting to statically determine the parameters passed into initial malloc/calloc calls. With this information, along with the outlined source code via LLVM's CodeExtractor (block 606), embodiments are able to automatically generate a JSON-based DAG that is compatible with the runtime framework presented here (block 608).
Thanks to the flexibility present in having each node abstracted as a function call, this JSON-based DAG can actually improve an application's execution by replacing a particular node's run_func with an optimized invocation that has the same function signature if a particular kernel is able to be recognized. For example, recognizing a naive for loop-based discrete Fourier transform (DFT) would allow this compilation process to substitute in a call to an FFT library or add support for an FFT accelerator. By compiling the modified IR source into a shared object, it can be used along with the JSON-based DAG to functionally recreate the user-provided application in the runtime framework. The end result is unlikely to be as optimized and parallelized at this stage as a hand-crafted DAG, but it provides a quick path for porting functionally correct code into the runtime presented.
This section presents four case studies to demonstrate the usability and portability of the proposed emulation framework 10 (e.g., emulation environment). In the first study, the validation mode of the framework is used to identify a suitable heterogeneous SoC configuration to meet the performance requirements. In the second study, the performance mode is used to narrow down on the scheduling policy for a given application domain. The portability of the framework is demonstrated by conducting a similar study on a different COTS platform in the third case study. As a fourth case study, the compilation toolchain that maps unlabeled, monolithic code to a DSSoC is illustrated. This section begins by providing a brief description of the hardware platforms and signal processing applications used for the studies.
A. Hardware Platforms and Applications
ZCU102 and Odroid XU3 platforms are used in the case studies. ZCU102 is a general-purpose evaluation kit built on top of Zynq UltraScale+™ MPSoC. This MPSoC combines general-purpose processing units (quad-core ARM Cortex A53 and dual-core Cortex-R5) and programmable fabric on a single chip. A resource pool is created which is composed of two FFT accelerators on the programmable fabric and three general-purpose CPU A53 cores to instantiate different heterogeneous SoC configurations. The fourth A53 core is used as an overlay processor to run the workload manager 14 and the application handler 12. On this platform, direct memory access (DMA) blocks are used to facilitate the transfer of data between memory and hardware accelerators through AXI4-Stream, a streaming protocol.
Odroid XU3 is a single board computer, which features an Exynos 5422 SoC. The SoC is based on the ARM heterogeneous big.LITTLE architecture in which the LITTLE cores are highly energy-efficient (Cortex-A7) and the big cores (Cortex-A15) are performance-oriented. The Cortex-A7 and Cortex-A15 in this SoC are quad-core 32-bit multi-processor cores implementing the ARMv7-A architecture. One of the LITTLE cores is used as an overlay processor to run the workload manager 14 and the application handler 12. The remaining four BIG cores and three LITTLE cores form the resource pool to instantiate different heterogeneous SoC configurations.
B. Case Study 1: Validation Mode
The primary use of the validation mode is to functionally verify the integration of an application task-graph, scheduling algorithm, and accelerator in the emulation framework 10. The validation mode is also used to obtain an estimate on the workload execution time and PE 18 utilization on different SoC configurations. The estimates obtained on the emulation framework 10 are not designed to be cycle-accurate compared to the real silicon chip. Instead, it is designed to assist hardware and software designers to obtain relative performance and PE 18 utilization of a given workload on different target SoC configurations.
However, the increase in CPU cores results in a greater improvement in the execution time compared to the FFT accelerators, i.e., execution time improvement is higher as the study moves from 1Core+1FFT to 2Core+1 FFT configuration compared to 1Core+2FFT configuration. This behavior is observed because the input sample count to the FFT accelerator is only 128. On the ZCU102 platform, an FFT of this size has a faster turn-around time on a CPU core compared to the FFT accelerator. The overhead associated with the data transfer from the main memory to programmable fabric and vice-versa in the ZCU102 platform limits the usability of the programmable fabric in processing such a small data set.
A negligible difference is observed between the execution times on 2Core+1 FFT and 2Core+2FFT configurations. This is because, for the 2Core+2FFT configuration, the resource manager 16 threads for the FFT accelerators share the CPU core. As a result, they keep cyclically preempting each other. The overhead involved in the OS level thread preemption and thread scheduling ends up dominating the benefits of using two FFT accelerators in this configuration. For the remaining configurations in the figures, each resource manager 16 thread executes on a dedicated CPU core. This ensures the improvement in the execution time with the increase in the PEs 18 in the heterogeneous SoC configuration.
PE resource utilization is calculated by computing the ratio between the usage time of a PE 18 and the total execution time of the workload. The utilization of the CPU cores is significantly higher than the FFT accelerators for the heterogeneous SoC. The maximum CPU core utilization observed is 80% for the 1Core+0FFT configuration. Because embodiments regularly execute the scheduling algorithm on the completion of each task, significant scheduling overhead is incurred. However, some embodiments incorporate task reservation queues on each PE 18 to reduce the impact of the scheduling overhead. From
C. Case Study 2: Performance Mode
This case study compares the performance of different scheduling algorithms (FRFS, MET, and EFT) on a DSSoC configuration composed of three cores and two FFT accelerators. The emulation framework 10 is operated in the performance mode. This mode is designed to emulate the dynamic injection of the applications on a target heterogeneous SoC. In the performance mode, the user needs to provide the frequency and probability of injection for each application. The user also needs to input the timeframe during which applications are injected. For the evaluation, it is assumed that applications are injected periodically with the probability of one in the test timeframe of 100 milliseconds. To create a new workload trace, the periodic duration is varied for each application to alter the average injection rate. Table I presents the standalone execution time for each application on a 3Core+2FFT SoC configuration. Table II presents the instance count for a given application in each workload trace. Compared to Pulse Doppler, higher injection frequencies are chosen for the range detection and WiFi applications because of their shorter execution time and smaller DAG.
In
The framework is successfully able to expose the limitations of underlying design decisions related to the SoC configuration and scheduling policies for a given set of applications. Traditionally, researchers use discrete event-based simulation tools, such as DS3 and SimGrid, to develop and evaluate new scheduling algorithms. These simulators rely on statistical profiling information to realize the performance of general-purpose cores and hardware accelerators. As a result, they are inadequate in capturing scheduling overhead and performing functional validation of the system and IP, as they are designed to operate without real applications and hardware. Cycle-accurate simulators, such as gem5 and PTLSim, address the drawbacks of discrete event simulators by performing cycle-by-cycle execution of the real applications and scheduling algorithms for the simulated target system or IP. However, these simulators are slow and primarily used to validate individual IP designs or few specific testcases for full system validation. The turnaround time of the emulation framework 10 is substantially lower compared to the cycle-accurate simulators, and its capability to capture the impact of scheduling overheads on the total execution time provide better estimates while performing design space exploration compared to the discrete event simulators.
D. Case Study 3: Performance Analysis on Odroid XU3
A linear correlation between the workload execution time and the job injection rate is observed. The configuration composed of three BIG cores and two LITTLE cores, i.e., 3BIG+2LTL, has the best execution time across different job injection rates. The configurations 3BIG+1 LTL, 4BIG+1 LTL, and 2BIG+3LTL perform comparable to the best performing configuration with less than 3% of impact on the performance. Interestingly, the workload execution time on the configurations 4BIG+3LTL and 4BIG+2LTL is higher than the execution time on the configuration 4BIG+1 LTL. This is because, in the framework, the scheduling complexity of the FRFS algorithm is proportional to the number of PEs 18 in the emulated SoC. As the PE count in the emulated SoC increases, the scheduling overhead becomes noticeable compared to the task execution time. Furthermore, the lower operating frequency of the overlay processor (LITTLE core) increases the scheduling overhead.
E. Case Study 4: Automatic Application Conversion
The preceding case studies have primarily focused on exploring performance estimates for different heterogeneous SoC configurations and workload scenarios while holding the applications used for evaluation fixed. However, demonstrating a meaningful path by which application developers can map novel applications to a fixed heterogeneous SoC configuration is a similarly critical part of the overall heterogeneous SoC design process. In this case study, the capabilities of the dynamic tracing-based compilation toolchain are explored through automatic mapping of a monolithic range detection C code to the emulation environment. The ZCU102 platform is targeted with a configuration composed of 3 cores and 1 FFT accelerator.
As described in Section II-D, the toolchain works by using TraceAtlas to dynamically trace the baseline application and extract kernels of interest via analysis of this runtime trace. In range detection, among the six kernels that are currently detected, three of them consist of heavy file I/O, along with two kernels consisting of two FFTs and one kernel consisting of the IFFT as shown previously in
For this particular application, the two FFTs and one IFFT were implemented as simple for-loop based DFTs and an inverse DFT (IDFT). As such, to explore the inherent ability to optimize through selecting semantically equivalent but highly optimized run_func invocations, an additional shared object library is compiled that contains two optimized implementations of the DFT kernel: one that uses FFTW compiled for ARM to invoke a highly optimized FFT and one that targets the FFT accelerator present on the ZCU102's programmable logic to test the framework's ability to transparently add support for accelerators. Through hash-based kernel recognition, the platform entries in the DAG JSON were then automatically redirected to this shared object through use of the shared object key as first demonstrated in the FFT_0 node of
While these results do require a fairly strict assumption that it is possible to recognize a kernel operationally in an automatic compilation process with no human input, these results present a promising pathway forward in exploring a generalizable compilation flow for heterogeneous SoCs. Despite this being a first pass implementation of such a compilation flow, benefits are observed through new optimization opportunities on the CPU side and through the ability to add automatic support for heterogeneous accelerators without any user intervention or compiler directives. Some embodiments enable further benefits such as support for automatic parallelization of independent kernels via analysis of their runtime memory access patterns and a more generalizable approach for recognizing kernels and pairing them with compatible optimized invocations.
The exemplary computer system 1300 in this embodiment includes a processing device 1302 or processor, a system memory 1304, and a system bus 1306. The system memory 1304 may include non-volatile memory 1308 and volatile memory 1310. The non-volatile memory 1308 may include read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), and the like. The volatile memory 1310 generally includes random-access memory (RAM) (e.g., dynamic random-access memory (DRAM), such as synchronous DRAM (SDRAM)). A basic input/output system (BIOS) 1312 may be stored in the non-volatile memory 1308 and can include the basic routines that help to transfer information between elements within the computer system 1300.
The system bus 1306 provides an interface for system components including, but not limited to, the system memory 1304 and the processing device 1302. The system bus 1306 may be any of several types of bus structures that may further interconnect to a memory bus (with or without a memory controller), a peripheral bus, and/or a local bus using any of a variety of commercially available bus architectures.
The processing device 1302 represents one or more commercially available or proprietary general-purpose processing devices, such as a microprocessor, CPU, or the like. More particularly, the processing device 1302 may be a complex instruction set computing (CISC) microprocessor, a reduced instruction set computing (RISC) microprocessor, a very long instruction word (VLIW) microprocessor, a processor implementing other instruction sets, or other processors implementing a combination of instruction sets. The processing device 1302 is configured to execute processing logic instructions for performing the operations and steps discussed herein.
In this regard, the various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with the processing device 1302, which may be a microprocessor, field programmable gate array (FPGA), a digital signal processor (DSP), an application-specific integrated circuit (ASIC), or other programmable logic device, a discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. Furthermore, the processing device 1302 may be a microprocessor, or may be any conventional processor, controller, microcontroller, or state machine. The processing device 1302 may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration).
The computer system 1300 may further include or be coupled to a non-transitory computer-readable storage medium, such as a storage device 1314, which may represent an internal or external hard disk drive (HDD), flash memory, or the like. The storage device 1314 and other drives associated with computer-readable media and computer-usable media may provide non-volatile storage of data, data structures, computer-executable instructions, and the like. Although the description of computer-readable media above refers to an HDD, it should be appreciated that other types of media that are readable by a computer, such as optical disks, magnetic cassettes, flash memory cards, cartridges, and the like, may also be used in the operating environment, and, further, that any such media may contain computer-executable instructions for performing novel methods of the disclosed embodiments.
An operating system 1316 and any number of program modules 1318 or other applications can be stored in the volatile memory 1310, wherein the program modules 1318 represent a wide array of computer-executable instructions corresponding to programs, applications, functions, and the like that may implement the functionality described herein in whole or in part, such as through instructions 1320 on the processing device 1302. The program modules 1318 may also reside on the storage mechanism provided by the storage device 1314. As such, all or a portion of the functionality described herein may be implemented as a computer program product stored on a transitory or non-transitory computer-usable or computer-readable storage medium, such as the storage device 1314, volatile memory 1310, non-volatile memory 1308, instructions 1320, and the like. The computer program product includes complex programming instructions, such as complex computer-readable program code, to cause the processing device 1302 to carry out the steps necessary to implement the functions described herein.
An operator, such as the user, may also be able to enter one or more configuration commands to the computer system 1300 through a keyboard, a pointing device such as a mouse, or a touch-sensitive surface, such as the display device, via an input device interface 1322 or remotely through a web interface, terminal program, or the like via a communication interface 1324. The communication interface 1324 may be wired or wireless and facilitate communications with any number of devices via a communications network in a direct or indirect fashion. An output device, such as a display device, can be coupled to the system bus 1306 and driven by a video port 1326. Additional inputs and outputs to the computer system 1300 may be provided through the system bus 1306 as appropriate to implement embodiments described herein.
The operational steps described in any of the exemplary embodiments herein are described to provide examples and discussion. The operations described may be performed in numerous different sequences other than the illustrated sequences. Furthermore, operations described in a single operational step may actually be performed in a number of different steps. Additionally, one or more operational steps discussed in the exemplary embodiments may be combined.
Those skilled in the art will recognize improvements and modifications to the preferred embodiments of the present disclosure. All such improvements and modifications are considered within the scope of the concepts disclosed herein and the claims that follow.
This application claims the benefit of provisional patent application Ser. No. 63/104,272, filed Oct. 22, 2020, the disclosure of which is hereby incorporated herein by reference in its entirety.
This invention was made with government support under FA8650-18-2-7860 awarded by the Defense Advanced Research Projects Agency. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2021/056290 | 10/22/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63104272 | Oct 2020 | US |