This disclosure relates generally to processing of a machine learning model, and, more particularly, to methods and apparatus to process a machine learning model in a multi-process web browser environment.
Nowadays, there is a momentum in the computing industry to deploy the machine learning (ML) workloads, especially deep learning (DL) models, to end-user edge devices, instead of server devices. The advantages of performing computations on edge devices include cost saving, privacy protection, and real-time performance. Machine learning workloads have been more recently provided to end-user edge devices in web browser environment(s). Hardware developers are developing hardware (e.g., central processing units (CPUs), graphics processing units (GPUs), vector processing units (VPUs), etc.,) and/or software (e.g., math kernel library deep neural network (MKL-DNN), compute library for deep neural networks (clDNN), etc.) optimizations to accelerate the DL computation at the edge device which, in some examples, involves offloading computations from a CPU to a GPU or other circuitry. However, web browser based environments make utilization of such optimizations difficult.
The FIGS. are not to scale. In general, the same reference numbers will be used throughout the drawing(s) and accompanying written description to refer to the same or like parts.
When a user causes a web browser of a computing system to navigate to a web site, the web browser downloads data including, for example, HyperText Markup Language (HTML) documents, cascading style sheet (CSS) documents, JavaScript files, etc. from a web server, executes the JavaScript code, and renders a user interface according to HTML and/or CSS. However, security risks exist such that the web browser may be compromised as a result of executing instructions from a malicious web site. To address this security challenge, modern web browsers (e.g., Google Chrome, Microsoft Edge, Mozilla Firefox, etc.), usually utilize multiple processes within a sandboxing architecture. As such, the web browser typically has two type of processes: unprivileged processes and privileged processes.
An unprivileged process implements a rendering engine and/or a JavaScript engine in a sandboxed environment. As such, the unprivileged process is only allowed to access the CPU to execute instructions, but is not allowed access to one or more of a file system, a display, network and/or devices attached to the computing system.
In contrast, a privileged process is allowed access to system resources, such as a graphics processing unit (GPU). To gain access to such system resources, the unprivileged process communicates with the privileged process using an inter-process-communication (IPC) protocol.
In the illustrated example of
In existing approaches, when displaying a web page, an unprivileged process parses HTML and/or CSS files to result in display of the web page. When the web page includes and/or references a script file (e.g., a JavaScript file), the script file is parsed and executed by the unprivileged process. If the script file includes the use of a machine learning model, the unprivileged process loads the machine learning model (e.g., from a network location), and constructs a computation graph representation of the machine learning model. In some cases, the computation graph is prepared for execution by a central processing unit (CPU).
To prepare an operation node for execution, the framework causes CPU binary instructions to be generated (e.g., compiled) and/or identified. For example, if the machine learning model were to utilize a TensorFlow.js framework (which uses JavaScript), a JavaScript engine may perform just-in-time (JIT) compilation to generate a CPU binary instruction. Alternatively, if the machine learning model were to utilize a WebAssembly (WASM) framework, the JavaScript engine directly generates the CPU binary. The CPU hardware is then directed to execute the generated CPU binary and returns the result to unprivileged process. The iteration of the computation graph continues until the output tensor is computed.
In some existing approaches, the computation graph may be executed by a graphics processing unit (GPU). For example, the TensorFlow.js framework utilizes a web graphics library (WebGL) application programming interface (API) to prepare instructions for execution by a GPU. Likewise, a WebDNN framework uses a web graphics processing unit (WebGPU) API. When an unprivileged process identifies an operation node for execution, the unprivileged process loads the GPU source code implementation of that operation and calls the corresponding API to execute the GPU shader source at the GPU. As this is done from an unprivileged process (e.g., a process without direct access to the GPU), the unprivileged process communicates the request to the privileged process. The request is communicated between the unprivileged process and the privileged process using an inter-process communication (IPC) protocol.
In existing systems, as the unprivileged process is not trusted by the privileged process, the request from the unprivileged process it validated. The privileged process validates the request and any provided parameters (e.g., the GPU shader source code). If validation succeeds, the GPU shader source code is provided to the GPU driver for execution by the GPU. After the GPU completes the execution of the GPU shader source code, the result is provided to the privileged process, which then communicates the result back to the unprivileged process. This process is iterated until the output tensor is computed.
Thus, in the context of
Moreover, the CPU execution of existing systems is not optimized. As JavaScript and WebAssembly are designed for general mathematics computation and cross-CPU architecture, the JavaScript engine cannot generate CPU(s) instruction specifically optimized for tensor operation (e.g., Intel advanced vector extensions (AVX) instructions, vector neural network instructions (VNNI) instruction, etc.).
Likewise, the GPU execution of existing systems is not optimized. For example, existing frameworks (e.g., the WebGL framework, WebGPU shader language, etc.) are designed to be cross-GPU architecture compliant, and the resultant tensor operations are not implemented to take advantage of hardware specific features such as, for example, Intel Open Computing Language (OpenCL) extensions.
Furthermore, in existing systems, the CPU and GPU executions are slow to start. Before achieving a first result, the CPU execution encounters compilation and code-generation overhead. The start of GPU execution is even slower, as such execution involves the overhead of transferring data over an IPC channel. Further, compilation of the GPU shader source code consumes compute time as well.
Example approaches disclosed herein utilize a computation graph CPU interpreter with optimized CPU operation binary code within the unprivileged process of multi-process web browser. Example approaches disclosed herein also utilize a computation graph GPU compilation framework with optimized GPU operation source code implementation for multi-process web browser. Example approaches disclosed herein also utilize a computation graph executor to distribute the execution of a computation graph to CPU interpreter or GPU compilation orchestrator according to graph execution profiling.
Such approaches enable the use of hardware-specific instructions, such as an AVX instruction and an VNNI instruction for CPU execution, and the use of OpenCL extensions for GPU execution. Moreover, example approaches disclosed herein enable a fast start and high sustained speed execution experience for deep learning workloads in web browser environments.
As noted above, a web browser typically has two types of instruction executors: unprivileged instruction executors and privileged instruction executors. The unprivileged instruction executors commonly implements components (e.g., a rendering engine, a JavaScript engine, etc.) in a sandboxed environment. As such, the unprivileged instruction executor is only allowed to access the CPU to execute instructions, but is not allowed access to one or more of a file system, a display, network and/or devices attached to the computing system.
The unprivileged instruction executor 212 of the illustrated example of
In contrast to the unprivileged instruction executor 212, the privileged instruction executor 214 is allowed access to system resources, such as a graphics processing unit (GPU). To gain access to such system resources, the unprivileged instruction executor 212 communicates with the privileged instruction executor 214 using the inter-process-communication (IPC) channel 225.
The example privileged instruction executor 214 of the illustrated example of
The example inter-process communication (IPC) channel 225 of the illustrated example of
The example GPU driver 227 of the illustrated example of
The example GPU instruction database 229 of the illustrated example of
The example CPU 233 of the illustrated example of
The example GPU 237 of the illustrated example of
As noted above, GPUs execute instruction packages commonly referred to as kernels, compute kernels, and/or shaders. Typically, the term shader is used when a kernel is used for graphics-related tasks such as, for example, DirectX, Open Graphics Library (OpenGL) tasks, pixel shader/shading tasks, vertex shader/shading tasks, etc. The term kernel is used for general purpose computational tasks such as, for example, Open Computing Language (OpenCL) tasks, C for Media tasks, etc. While example approaches disclosed herein use the term kernel, such approaches are equally well suited to be used on shaders. In examples disclosed herein, such kernels roughly correspond to a compiled version of a computation graph. As used herein, a GPU kernel refers to a kernel in binary format.
The example script engine 310 of the illustrated example of
The example graph executor 320 of the illustrated example of
The example CPU interpreter 330 of the illustrated example of
The example optimized CPU code data store 335 of the illustrated example of
The example graph profiler 340 of the illustrated example of
In examples disclosed herein, a computation graph is considered to be executed frequently when it has been executed more than a threshold number of times within a previous threshold time period (e.g., more than twice in the last minute). However, any other factors may be used to determine whether the computation graph is executed frequently and/or, more generally, whether the computation graph should be compiled for future execution by the GPU 237. For example, the size of the computation graph (which may have an impact on the amount of resources used to compile the computation graph), the origin of the computation graph (e.g., whether the computation graph originates from a frequently accessed network resource and/or website), the types of operations included in the computation graph (which may indicate whether compilation is expected to be successful), etc.
The example GPU compiler interface 350 of the illustrated example of
The example IPC client 360 of the illustrated example of
The example IPC server 410 of the illustrated example of
The example GPU compilation orchestrator 420 of the illustrated example of
The example GPU compilation orchestrator 420 sends the source code for compilation into a GPU-specific kernel (e.g., binary code) to the GPU driver for compilation. Upon completion of the compilation, the example GPU compilation orchestrator 420 provides an indication of the completion of the compilation to the graph executor 320 via the IPC server 410.
The example request validator 430 of the illustrated example of
The example optimized GPU code data store 440 of the illustrated example of
While an example manner of implementing the web browser 210 of
A flowchart representative of example hardware logic, machine readable instructions, hardware implemented state machines, and/or any combination thereof for implementing the unprivileged instruction executor 212 of
As mentioned above, the example processes of
“Including” and “comprising” (and all forms and tenses thereof) are used herein to be open ended terms. Thus, whenever a claim employs any form of “include” or “comprise” (e.g., comprises, includes, comprising, including, having, etc.) as a preamble or within a claim recitation of any kind, it is to be understood that additional elements, terms, etc. may be present without falling outside the scope of the corresponding claim or recitation. As used herein, when the phrase “at least” is used as the transition term in, for example, a preamble of a claim, it is open-ended in the same manner as the term “comprising” and “including” are open ended. The term “and/or” when used, for example, in a form such as A, B, and/or C refers to any combination or subset of A, B, C such as (1) A alone, (2) B alone, (3) C alone, (4) A with B, (5) A with C, (6) B with C, and (7) A with B and with C. As used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. Similarly, as used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. As used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. Similarly, as used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B.
In some examples, it is expected that a script may be executed multiple times. As a result, a prior execution of the script may have resulted in the script being compiled for future execution at the GPU (instead of using interpreted execution at the CPU). In such an example, future execution of the computation graph included in the script may be more efficient if executed using the compilation mode. However, if the topology of the computation graph included in the script has changed, such execution using the compilation mode (where such compilation was performed using a different version of the computation graph) may provide an incorrect result. The example graph executor 320 determines whether the topology of the computation graph has changed. (Block 508). In examples disclosed herein, the example graph executor 320 determines whether the topology of the computation graph has changed based on a hash of the computation graph as compared to a prior hash of the computation graph. However, any other approach to determining whether the computation graph has changed may additionally or alternatively be used. If the example graph executor 320 determines that the topology has changed, the example GPU compiler interface 350 clears any prior flags and/or settings indicating that the computation graph is to be executed using the compilation mode. (Block 510).
The graph executor 320 selects a mode of operation for the computation graph. (Block 515). In examples disclosed herein, the example graph executor 320 selects between (1) an interpretation mode, where the instructions of the computation graph are interpreted for execution by the CPU 233, and (2) a compilation mode, where a compiled version of the computation graph is executed by the GPU 237. In examples disclosed herein, the graph executor 320 selects the mode of operation based on whether the computation graph has previously been compiled for execution by the GPU 237 as part of selecting the mode of operation. As noted below, such instructions may be compiled, and a flag may be set indicating the mode of operation to be the compilation mode, in response to that computation graph being frequently executed.
In some examples, other factors may be considered for determining the mode of operation. For example, when a topology of the computation graph is changed at runtime (or at any other time since a prior compilation of the computation graph), the example graph executor 320 may detect such a change and set the mode of operation for that computation graph to the interpretation mode (e.g., in blocks 508, 510). In some examples, the change in the topology of the computation graph may be detected by, for example, comparing a hash of the computation graph to a prior hash of the computation graph (e.g., a hash stored in connection with the compiled version of the computation graph). In examples disclosed herein, if the graph executor 320 is not aware of a compiled version of the computation graph having been previously created, the graph executor 320 defaults to the interpretation mode.
If the example graph executor 320 determines that the interpretation mode is to be used (e.g., Block 515 returns a result of INTERPRETATION MODE), the graph executor 320 identifies a node (e.g., an operation node) of the computation graph that is ready for execution. (Block 520). The example CPU interpreter 330 performs a lookup of the corresponding optimized CPU code in the optimized CPU code data store 335. (Block 522). In examples disclosed herein, the lookup in the optimized CPU code data store is based on the CPU hardware (e.g., the CPU 233) that will perform the execution. As a result, the optimized CPU code does not need to be platform agnostic and can, instead, utilize platform-specific instructions such as, for example, Intel advanced vector extensions (AVX) instructions, vector neural network instruction (VNNI) instructions, etc. The example CPU interpreter 330 provides the optimized CPU code to the CPU 233 execution. (Block 524). The example CPU interpreter 330 accesses a result of the CPU execution. (Block 526). The result is provided to the graph executor 320, which determines whether the execution of the computation graph is complete. (Block 530). If the execution of the computation graph is not complete, control proceeds to block 520 where blocks 520 through 530 are repeated until execution of the computation graph is complete.
Upon completion of the execution of the computation graph, the example graph executor 320 provides the result of the execution (e.g., the output tensor) to the script engine 310. (Block 535).
In some cases, the code to be executed (e.g., the web application) may cause execution of a same computation graph many times. In such an example, it is more efficient for such a computation graph to be executed by a GPU. Execution of instructions by a GPU incurs additional overhead of communicating with the privileged instruction executor as well as compiling such computation graph for execution by the GPU 237. However, when such computation graph is executed frequently, such GPU-based execution can be more efficient. The example graph profiler 340 profiles the execution of the computation graph to determine whether execution is frequent. (Block 540).
In examples disclosed herein, a computation graph is considered to be executed frequently when it has been executed more than a threshold number of times within a previous threshold time period (e.g., more than twice in the last minute). However, any other factors may be used to determine whether the computation graph is executed frequently and/or, more generally, whether the computation graph should be compiled for future execution by the GPU 237. For example, the size of the computation graph (which may have an impact on the amount of resources used to compile the computation graph), the origin of the computation graph (e.g., whether the computation graph originates from a frequently accessed network resource and/or website), the types of operations included in the computation graph (which may indicate whether compilation is expected to be successful), etc.
If the computation graph is executed frequently (block 545 returns a result of YES), the example graph executor 320 sends the computation graph to the example GPU compiler interface 350 for compilation into GPU instructions. (Block 555). An example approach to compiling the computation graph into GPU instructions is described in further detail in connection with
In the illustrated example of
Returning to block 515, if the example graph executor 320 determines that the mode of operation for the computation graph should be the compilation mode e.g. the computation graph had been previously compiled into GPU instructions, the example GPU compiler interface 350 requests execution of the compile GPU instructions. (Block 570). In examples disclosed herein, the GPU compiler interface 350 interfaces with the example privileged instruction executor 214 via the IPC client 360 to request execution of the compiled GPU instructions. The example GPU compiler interface 350 then accesses a result of execution of the compiled GPU instructions via the IPC client 360. (Block 575). The result of the execution is then provided to the graph executor 320 so that the result (e.g., the output tensor) can further be provided to the example script engine. (Block 580). Control then returns to block 540 where the example graph profiler 340 profiles execution of the computation graph to determine whether to compile the GPU instructions. In some examples, after profiling the computation graph, the example graph profiler 340 may determine that the computation graph is no longer frequently executed (e.g., block 545 may return a result of NO), in which case the graph profiler 340 may update the mode of operation for the graph to return the computation graph to being executed using the interpretation mode.
In contrast to prior approaches for execution of machine learning workloads at a GPU, where individual nodes of a computation graph were provided to the GPU for individual execution, in the illustrated example of
The example IPC server 410 interacts with the example request validator 430 to determine whether the computation graph is valid. (Block 630). In examples disclosed herein the request validator 430 determines whether the computation graph and/or, more generally the request to compile the computation graph received from the example unprivileged instruction executor 212 is valid based on additional parameters provided in the indication of the computation graph to be compiled. In some examples, the additional parameters may include, for example, a certificate parameter indicating that the request is valid. However, any other approach to validating a request from the unprivileged instruction executor 212 may additionally or alternatively be used.
If the example request validator 430 determines that the request is not valid (e.g., block 630 returns a result of not valid), the example IPC server 410 indicates an invalidity of the request to compile the computation graph to the unprivileged instruction executor. (Block 640). In such an example, the GPU compiler interface 350 of the example unprivileged instruction executor does not record that the compilation mode should be used upon subsequent requests to execute the computation graph. The example process 555 of the illustrated example of
Returning to block 630, if the example request validator 430 determines that the request for compilation of the computation graph is valid (e.g., block 630 returns a result of VALID), the example GPU compilation orchestrator 420 loads GPU source code corresponding to each of the nodes of the computation graph. (Block 650). That is, the example GPU compilation orchestrator 420 constructs a GPU source code based on the operations that are to be performed as part of the connotation graph. In examples disclosed herein, the source code is retrieved from the optimized GPU code data store 440. Moreover, the optimized GPU code data store 440 stores optimized GPU code that, for example, enables utilization of hardware specific instructions and/or extensions. For example, hardware-specific features such as, Intel Open Computing Language (OpenCL) extensions, may be utilized on the instructions stored in the optimized GPU code data store 440.
The example GPU compilation orchestrator 420 sends the source code for compilation into a GPU-specific kernel (e.g., binary code) to the GPU driver. (Block 660). The example GPU driver 227 then compiles the GPU source code into GPU-specific kernel (e.g., binary code), and sto22res the GPU-specific kernel in the GPU instruction database 229. During the compilation, the example GPU compilation orchestrator 420 awaits completion of the compilation. (Block 670).
The example GPU compilation orchestrator 420 provides an indication of the completion of the compilation to the graph executor 320 via the IPC server 410. (Block 680). The graph executor 320 then marks the computation graph is compiled and switches to GPU compilation mode for subsequent executions of the computation graph (see block 560 of
The example GPU compilation orchestrator 420 requests execution of the kernel by the GPU 237. (Block 730). After the execution of the kernel completes, the result (e.g., the output tensor) is provided to the unprivileged instruction executor 212 (e.g., via the IPC communication channel 225). (Block 740). The example process 700 of
The processor platform 800 of the illustrated example includes a processor 812. The processor 812 of the illustrated example is hardware. For example, the processor 812 can be implemented by one or more integrated circuits, logic circuits, microprocessors, GPUs, DSPs, or controllers from any desired family or manufacturer. The hardware processor may be a semiconductor based (e.g., silicon based) device. In this example, the processor implements the example inter-process communication channel 225, the example GPU driver 227, the example script engine 310, the example graph executor 320, the example CPU interpreter 330, the example graph profiler 340, the example GPU compiler interface 350, the example IPC 360, the example IPC server 410, the example GPU compilation orchestrator 420, and the example request validator 430.
The processor 812 of the illustrated example includes a local memory 813 (e.g., a cache). The processor 812 of the illustrated example is in communication with a main memory including a volatile memory 814 and a non-volatile memory 816 via a bus 818. The volatile memory 814 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM), Dynamic Random Access Memory (DRAM), RAMBUS® Dynamic Random Access Memory (RDRAM®) and/or any other type of random access memory device. The non-volatile memory 816 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 814, 816 is controlled by a memory controller.
The processor platform 800 of the illustrated example also includes an interface circuit 820. The interface circuit 820 may be implemented by any type of interface standard, such as an Ethernet interface, a universal serial bus (USB), a Bluetooth® interface, a near field communication (NFC) interface, and/or a PCI express interface.
In the illustrated example, one or more input devices 822 are connected to the interface circuit 820. The input device(s) 822 permit(s) a user to enter data and/or commands into the processor 812. The input device(s) can be implemented by, for example, an audio sensor, a microphone, a camera (still or video), a keyboard, a button, a mouse, a touchscreen, a track-pad, a trackball, isopoint and/or a voice recognition system.
One or more output devices 824 are also connected to the interface circuit 820 of the illustrated example. The output devices 824 can be implemented, for example, by display devices (e.g., a light emitting diode (LED), an organic light emitting diode (OLED), a liquid crystal display (LCD), a cathode ray tube display (CRT), an in-plane switching (IPS) display, a touchscreen, etc.), a tactile output device, a printer and/or speaker. The interface circuit 820 of the illustrated example, thus, typically includes a graphics driver card, a graphics driver chip and/or a graphics driver processor.
The processor platform 800 of the illustrated example includes a graphics processing unit (GPU) 237 in communication via the bus 818.
The interface circuit 820 of the illustrated example also includes a communication device such as a transmitter, a receiver, a transceiver, a modem, a residential gateway, a wireless access point, and/or a network interface to facilitate exchange of data with external machines (e.g., computing devices of any kind) via a network 826. The communication can be via, for example, an Ethernet connection, a digital subscriber line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-of-site wireless system, a cellular telephone system, etc.
The processor platform 800 of the illustrated example also includes one or more mass storage devices 828 for storing software and/or data. Examples of such mass storage devices 828 include floppy disk drives, hard drive disks, compact disk drives, Blu-ray disk drives, redundant array of independent disks (RAID) systems, and digital versatile disk (DVD) drives.
The machine executable instructions 832 of
From the foregoing, it will be appreciated that example methods, apparatus and articles of manufacture have been disclosed that enable efficient execution of computation graphs using CPUs and GPUs. The disclosed methods, apparatus and articles of manufacture improve the efficiency of using a computing device by enabling interpreted execution of computation graphs on a CPU using optimized CPU instructions, as well as enabling a transition over to executing compiled GPU instructions that are compiled using GPU-specific source code. For example, by utilizing optimized CPU instructions (e.g., using AVX instructions), CPU Interpreter execution is about 3.5× faster than existing WebAssembly execution. Moreover, by utilizing optimized GPU operation implementation (e.g., Intel OpenCL extensions), GPU Compiler execution is about 4X faster than WebGL execution. Furthermore, while GPU execution starts slower than CPU execution (due to overhead associated with compiling GPU instructions and communicating such computation graph via IPC), using approaches disclosed herein enables a more controlled switch to utilization of a GPU compiler for better-sustained performance. The disclosed methods, apparatus and articles of manufacture are accordingly directed to one or more improvement(s) in the functioning of a computer.
Example 1 includes an apparatus for processing a machine learning model in a multi-process web browser environment, the apparatus including a graph executor to determine a mode of operation for a computation graph to be executed, a central processing unit (CPU) interpreter to lookup a CPU instruction corresponding to a node of the computation graph, the CPU instruction being a CPU-specific instruction for execution by at least one processor, a graph profiler to determine whether the computation graph is frequently executed, and a graphics processing unit (GPU) compiler interface to, in response to determining that the computation graph is frequently executed, transmit a request for compilation of at least two nodes of the computation graph into a GPU kernel for execution at a GPU.
Example 2 includes the apparatus of example 1, wherein the GPU compiler interface is to transmit a request for execution of the GPU kernel.
Example 3 includes the apparatus of example 1, wherein the GPU compiler interface is further to update the mode of operation for the computation graph, and the graph executor is to determine that the computation graph is to be executed using a compilation mode in response to the updating of the mode of operation for the computation graph.
Example 4 includes the apparatus of example 3, further including a request validator to, in response to the request for compilation of the computation graph, validate the request to compile the computation graph into the GPU kernel.
Example 5 includes the apparatus of example 4, further including a GPU compilation orchestrator to, in response to the request validator validating the request, identify GPU source code corresponding to the node of the computation graph, and compile the GPU source code into the kernel.
Example 6 includes the apparatus of example 5, wherein the GPU source code is a GPU-specific instruction for execution by the GPU.
Example 7 includes the apparatus of example 6, wherein the GPU-specific instruction is an open compute language instruction.
Example 8 includes the apparatus of example 1, wherein the CPU-specific instruction is an advanced vector extension instruction.
Example 9 includes at least one non-transitory computer readable medium comprising instructions which, when executed, cause at least one processor to at least determine a mode of operation for a computation graph to be executed, in response to determining that the computation graph is to be executed using an interpretation mode, perform a lookup of a central processing unit (CPU) instruction corresponding to a node of the computation graph, the CPU instruction being a CPU-specific instruction for execution by the at least one processor, profile execution of the computation graph to determine whether the computation graph is frequently executed, and in response to determining that the computation graph is frequently executed transmit a request for compilation of the computation graph into a graphics processing unit (GPU) kernel for execution at a GPU, and update the mode of operation for the computation graph.
Example 10 includes the at least one non-transitory computer readable medium of example 9, wherein the instructions, when executed, further cause the CPU instruction to be executed by the at least one processor.
Example 11 includes the at least one non-transitory computer readable medium of example 9, wherein the instructions, when executed, further cause the at least one processor to, in response to determining that the computation graph is to be executed using a compilation mode, transmit a request for execution of the GPU kernel.
Example 12 includes the at least one non-transitory computer readable medium of example 11, wherein the instructions, when executed, further cause the at least one processor to transmit the request for the execution of the GPU kernel to a privileged instruction executor.
Example 13 includes the at least one non-transitory computer readable medium of example 11, wherein the instructions, when executed, further cause the at least one processor to transmit the request for the execution of the GPU kernel via an inter-process communication channel.
Example 14 includes the at least one non-transitory computer readable medium of example 9, wherein the instructions, when executed, further cause the at least one processor to validate the request for compilation of the computation graph, in response to the validating of the request, identify GPU source code corresponding to the node of the computation graph, and compile the GPU source code into the kernel.
Example 15 includes the at least one non-transitory computer readable medium of example 14, wherein the GPU source code is a GPU-specific instruction for execution by the GPU.
Example 16 includes the at least one non-transitory computer readable medium of example 15, wherein the GPU-specific instruction is an open compute language instruction.
Example 17 includes the at least one non-transitory computer readable medium of example 9, wherein the CPU-specific instruction is an advanced vector extension instruction.
Example 18 includes an apparatus for processing a machine learning model in a multi-process web browser environment, the apparatus including means for determining a mode of operation for a computation graph to be executed, means for identifying a CPU instruction corresponding to a node of the computation graph, the CPU instruction being a CPU-specific instruction for execution by at least one processor, means for profiling to determine whether the computation graph is frequently executed, and means for transmitting, in response to determining that the computation graph is frequently executed, a request for compilation of the computation graph into a GPU kernel for execution at a GPU.
Example 19 includes the apparatus of example 18, wherein the means for transmitting is to transmit a request for execution of the GPU kernel.
Example 20 includes the apparatus of example 18, wherein the means for transmitting is further to update the mode of operation for the computation graph, and the means for determining is to determine that the computation graph is to be executed using a compilation mode in response to the updating of the mode of operation for the computation graph.
Example 21 includes the apparatus of example 20, further including means for validating, in response to the request for compilation of the computation graph, the request to compile the computation graph into the GPU kernel.
Example 22 includes the apparatus of example 21, further including means for selecting, in response to the means for validating validating the request, GPU source code corresponding to the node of the computation graph, and compiling the GPU source code into the kernel.
Example 23 includes a method of processing a machine learning model in a multi-process web browser environment, the method including determining, by executing an instruction with at least one processor, a mode of operation for a computation graph to be executed, and in response to determining that the computation graph is to be executed using an interpretation mode performing a lookup of a central processing unit (CPU) instruction corresponding to a node of the computation graph, the CPU instruction being a CPU-specific instruction for execution by the at least one processor, profiling execution of the computation graph to determine whether the computation graph is frequently executed, and in response to determining that the computation graph is frequently executed, compiling the computation graph into a graphics processing unit (GPU) kernel for execution at a GPU, and updating the mode of operation for the computation graph.
Example 24 includes the method of example 23, further including causing the CPU instruction to be executed by the at least one processor.
Example 25 includes the method of example 23, further including, in response to determining that the computation graph is to be executed using a compilation mode, transmitting a request for execution of the GPU kernel.
Example 26 includes the method of example 25, wherein the determining that the computation graph is to be executed using the compilation mode is performed in response to the updating of the mode of operation for the computation graph.
Example 27 includes the method of example 25, wherein the request for the execution of the GPU kernel is transmitted to a privileged instruction executor.
Example 28 includes the method of example 25, wherein the request for the execution of the GPU kernel is transmitted via an inter-process communication channel.
Example 29 includes the method of example 23, wherein the compiling of the computation graph into the GPU kernel includes accessing a request to compile the computation graph into the GPU kernel, validating the request, in response to the validating of the request, identifying GPU source code corresponding to the node of the computation graph, and compiling the GPU source code into the kernel.
Example 30 includes the method of example 29, wherein the GPU source code is a GPU-specific instruction for execution by the GPU.
Example 31 includes the method of example 30, wherein the GPU-specific instruction is an open compute language instruction.
Example 32 includes the method of example 23, wherein the CPU-specific instruction is an advanced vector extension instruction.
Although certain example methods, apparatus and articles of manufacture have been disclosed herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all methods, apparatus and articles of manufacture fairly falling within the scope of the claims of this patent.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2018/123216 | 12/24/2018 | WO | 00 |