The present disclosure generally relates to using machine learning models for processing data objects such as documents, images, or multimedia presentations. More specifically, but not by way of limitation, the present disclosure relates to programmatic techniques for characterizing and partitioning such a machine learning model for more efficient processing.
Common computing devices, including smartphones, tablets, and notebook computers, are becoming more and more capable and thus machine learning models are increasingly designed for such devices. For example, the processing of a document by such a device for rendering may be streamlined using a neural network-based machine learning model when the pages are rendered in order. The device renders early pages of the document quickly, and then continues to process the subsequent pages for efficient rendering using machine learning as the pages are being rendered. Machine learning models can also be used to adaptively predict user input to enhance actual or perceived performance of games or other applications.
Certain aspects and features of the present disclosure relate to a computer-implemented method. The method includes accessing a machine learning model configured for processing a data object and partitioning the machine learning model into a number of partitions. The method further includes characterizing each of the partitions of the machine learning model with respect to runtime requirements. The method also includes executing each of the partitions of the machine learning model using a runtime environment corresponding to runtime requirements of the respective partition to process the data object, and rendering output based on the processing of the data object.
Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used in isolation to determine the scope of the claimed subject matter. The subject matter should be understood by reference to appropriate portions of the entire specification of this disclosure, any or all drawings, and each claim.
Features, embodiments, and advantages of the present disclosure are better understood when the following Detailed Description is read with reference to the accompanying drawings, where:
As described above, machine learning models can be used on smartphones, tablets, and other computing devices to improve the user experience. However, existing methods of executing machine learning models on end-user computing devices can strain the computing resources of such devices. However, techniques to process portions of an object such as a stored document and/or image using a machine learning model while other portions of the object are being rendered or otherwise output can result in resource competition between operating system or application components, which can in turn result in undesirable effects, such as staggered or delayed scrolling and visual artifacts.. For example, when rendering documents on a low-end device, jank may be observed. Jank is a term for stuttering in scrolling or other user interface activity caused by skipped frames. Jank can result from a machine learning model’s extended use of GPU resources.
Machine learning models can be selected and executed to minimize such undesirable effects. For example, a neural architecture search (NAS) can be executed by a software development platform or otherwise employed to find the best neural network for a given task and device, one that balances accuracy, processing time, memory usage, etc. Once the appropriate neural network is selected using NAS, the application that runs the network can be designed to use the most appropriate computing resources for that network architecture. NAS design approaches find neural networks that are suited for a particular task, but they treat the network as a unitary framework to be used for the task even though different layers of the network may require different resources for their most efficient execution. Thus, finding the most efficient model for a given application in order to minimize these issues often does not completely eliminate the problem.
Selecting the best machine learning model to minimize undesirable effects can be burdensome for developers. Manual searches are time consuming, and must be repeated at regular intervals as new versions of the software are developed since the array of machine learning models available is continuously changing. NAS approaches to development can also be expensive as they can require many high-end GPUs to perform an effective search.
Embodiments described herein address these issues by programmatically dividing or partitioning a machine learning model to run different parts of the model with different runtime profiles to balance trade-offs between execution speed and undesirable effects in output. When, or before a computing device accesses an object and a machine learning model configured for processing the object, the model is divided into multiple partitions. For example, if the machine learning model is a neural network, each partition includes a subset of layers of the entire network. A computing device characterizes each of the partitions of the machine learning model with respect to runtime requirements and executes each of the partitions using a runtime environment corresponding to the respective runtime requirements of each partition in order to process the object. Output can then be rendered based on this processing, which is more efficient than processing the object while treating the model as a unitary framework to be processed entirely with one set of runtime requirements. Since this partitioning greatly improves the performance of any model, developers can choose from a variety of models without conducting an exhaustive search to find the one model that would provide the best overall performance. Developers thus have more flexibility to choose from various machine learning models to accomplish a task, streamlining the development process for applications or new versions of applications.
For example, a document viewing application provides a computing device with the capability of displaying and scrolling through a document with various types of information such as graphics, text, tables, photos, or even embedded videos. An on-device neural network used for processing such a document uses different layers to process the various types of information in the document. Some layers may execute more efficiently in one runtime environment while other layers may execute more efficiently in another. For example, some layers may have more GPU-efficient runtime requirements and other may have more CPU-efficient runtime requirements. As another example, some layers may run more efficiently in a runtime environment that relies on low-memory, complex operators, others may run more efficiently in an environment that makes use of high-memory optimized resources, while others may run most efficiently using specialized post-processing. Rather than running the entire model in a single environment, and in order to take advantage of these varying requirements for layers of the neural network, the model is partitioned to run different parts of the model with different runtime profiles. A computing device characterizes each of the partitions of the model with respect to runtime requirements and executes each of the partitions in real time using a runtime environment corresponding to the respective runtime requirements. Scrolling and other movement between different parts of the displayed document is smooth, with little or no dropped frames or stuttering. Partitioning of the model can be accomplished on the end-user device, either statically as part of the installation or startup of an application, or dynamically depending on settings or other conditions. Alternatively, partitioning of the model can be accomplished during application development and/or deployment, with the partitioned model being distributed for installation along with or as part of the application.
In some examples, partitioning of the machine learning model can also be used to provide more targeted information security, which can further improve performance. For example, instead of encrypting the entire model, the system can make use of a model with encryption applied only to partitions that include or process sensitive information. In some examples, some partitions can be stored off-device, on a server or in the cloud, to further improve resource utilization or reduce on-device storage requirements. Partitions can also be reused for multiple applications. Developers can select any of various models and achieve a high level of performance without necessarily selecting one model that is a compromise for some functions while being the best for other functions.
The use of partitioned machine learning models for processing objects can provide seamless object processing without slowdowns or interruptions, especially for real-time applications that are increasingly used on mobile computing devices. Splitting the model into smaller partitions provides for more efficient memory use. Certain blocks of operators use less memory when run on CPU, a GPU, or other delegates of the operating system and optimizing the use of these resources translates into better performance and stability, as compared to picking a single delegate to run the entire model. The model-partitioning approach disclosed herein can be used for many types of processing of various data objects types. Operations for handling presentation media such as document display, image display, audio processing, automated translation, and automated captioning are but a few examples.
The application 102 also generates an output interface 130. Output interface 130 is used by application 102 to produce rendered media 132 from stored media object 122. For example, stored media object 122 may be a document and rendered media 132 may include various pages with graphics, photos, text, etc. In some embodiments, the application 102 uses, movement and selection inputs 136, which can include touch, pointing device signals and the like, for scrolling, movement, or selection and overall control of rendered media 132. These interactions with rendered media 132 are made more smooth and efficient through applying the most efficient runtime environments in accordance with runtime requirement definitions 120 to each of the partitions 111 of the machine learning model 110. While some examples are presented as that of a document viewing application rendering a document, an application using runtime-specific partitioning can be used for any type of data object, or for processing other rendering. Examples include editing, viewing, or analyzing documents, images, video, or audio, text or speech translation, graphic production, media streaming, and automated captioning.
In another example, partitioning may be accomplished when the application is compiled or distributed, in which case, application 102 is a software development platform or software distribution application. In this case, the application with the partitioned model is forwarded to and resides on end-user computing device 140 (a tablet computer) and may be deployed to device 140 through cloud computing system 106. In this case, on-device partitions 141 are used in processing media object 142. An output interface may not be fully implemented by application 102, but instead may be provided by computing device 140, with movement and selection inputs provided through a touchscreen.
Partition 202 in
Still referring to
At block 308 of process 300, the computing device executes each of the partitions of the machine learning model using a runtime environment that corresponds to the respective runtime requirements in order to process the data object. To continue with the example of computing device 200, partition 206 is run with fast, 8-bit, quantized processing and partition 208 is restricted to the CPU as opposed to the GPU of computing device 200. At block 310, the computing device renders output based on the processing of the data object. This output may be, as examples, a portion of a document, where the document is the data object being processed, or a portion of a video if a video file is the data object being processed.
Continuing with
All of the above-mentioned layers can be included in on-device neural network model 416, shown on the left side of
Partition 436 is for layers 413 and 414, which are custom, non-maximum suppression layers that run outside the typical machine learning runtime environment. Layers 413 and 414 are used for post processing. An application using a model as described above has been tested both with partitioning and without partitioning. The partitioning resulted in a 30 % savings in memory consumption, primarily due to partitions other than the backbone partition 430 consuming far less memory running using an XNN neural network package or using the CPU By splitting up the original model 416, greater control is provided to determine which partitions can be sped up in different ways to achieve the lowest inference time. For instance, certain blocks of operators run fastest on the CPU, whereas other, highly parallelizable partitions run faster on the GPU. Further, partitions can be quantized selectively to 16-bit or 8-bit processes to provide the least expense of accuracy but improve the time taken to obtain the model’s result.
As previously mentioned, with document viewing applications, on-device deep learning models may compete for resources with other models as well as the UI rendering processing on the GPU causing jank, which can be especially noticeable when input such as pinching, scrolling, and zooming is received. Partitioning so that the model does not hold onto the GPU for the entire duration of the model inference can significantly reduce jank. Further, by processing partitions that can run more efficiently on the CPU using the CPU, or by using an XNN package delegate, instead of the GPU, processor cycles needed for the UI thread to render smoothly can run in parallel with the model inference.
As an example, a document viewing application has been run both with and without partitioning of the model used by the application. Without partitioning, the application used the GPU for processing. The partitioning resulted in the offloading of some partitions to an XNN package delegate or the CPU while running the only remaining partitions on the GPU. The latency of the offloaded partitions was reduced, in at least one instance by almost 700 ms, which noticeably reduced jank.
The use of shared partitions is not limited to backbones. Partition 504 and partition 510 both include portions of additional shared partition 510. Such a system of partitions provides flexibility and control over the orchestration of running multiple models in parallel to stay within performance targets.
At block 602 of process 600, the computing device 101 accesses stored machine learning model 110 configured for processing a data object. At block 604, computing device 101 partitions the machine learning model into multiple partitions 111 and/or 112. At block 606, computing device 101 characterizes each of the partitions of the machine learning model with respect to runtime requirement definitions 120 for running each particular partition of the model most efficiently on computing device 140. For example, a partition may run most efficiently with resources designed for a low memory, complex operator, or a high memory, optimized operator. A partition may run most efficiently using specialized post processing. As other examples, some partitions may be more GPU efficient while other partitions may be more CPU efficient, while still other partitions can run efficiently in either the GPU or the CPU In some examples, the partitioning of the model provides for a more compact memory footprint, in part by bypassing and improving on the default configurations imposed by the end-user computing device or its operating system frameworks. The functions included in block 602, block 604, block 606, and block 608, all discussed with respect to
Continuing with
While the examples above make use of deep learning neural networks, other types of machine learning models can be partitioned as described herein. For example, a random forest model can be used. Such models are typically very wide instead of very deep. It can be beneficial to break down such a model based on input features; this would allow for control of the memory needed by a very wide model vs. the potential for significant parallelism. The same security considerations previously mentioned with respect to deep learning neural networks also exist with deep forest models, i.e. that only part of the model needs to be encrypted.
A software development or distribution platform that partitions a model and includes the partitions in software that is destined for various end-user devices can create customized partitioning based on each device or type of device. For example, low-end mobile devices often have limited memory or computational power compared to mid or high-end devices. With unitary machine learning models, owing to strict resource constraints, such devices are often not eligible for deployment of machine learning models due to performance issues when run in the usual configuration. Partitioning as described herein makes it possible to have fine-grained control of the manner of dividing the machine learning model to prioritize lower resource use, energy efficiency and other characteristics to enable deployed software to meet performance goals on low-end devices. A single machine learning model can be split into different partitions in different ways on a per-device or per tier basis to achieve the best results.
Still referring to
The system 700 of
Staying with
Numerous specific details are set forth herein to provide a thorough understanding of the claimed subject matter. However, those skilled in the art will understand that the claimed subject matter may be practiced without these specific details. In other instances, methods, apparatuses, or systems that would be known by one of ordinary skill have not been described in detail so as not to obscure claimed subject matter.
Unless specifically stated otherwise, it is appreciated that throughout this specification discussions utilizing terms such as “processing,” “computing,” “calculating,” “selecting,” “creating,” and “determining,” or the like refer to actions or processes of a computing device, such as one or more computers or a similar electronic computing device or devices, that manipulate or transform data represented as physical electronic or magnetic quantities within memories, registers, or other information storage devices, transmission devices, or display devices of the computing platform.
The system or systems discussed herein are not limited to any particular hardware architecture or configuration. A computing device can include any suitable arrangement of components that provide a result conditioned on one or more inputs. Suitable computing devices include multi-purpose microprocessor-based computer systems accessing stored software that programs or configures the computing system from a general purpose computing apparatus to a specialized computing apparatus implementing one or more implementations of the present subject matter. Any suitable programming, scripting, or other type of language or combinations of languages may be used to implement the teachings contained herein in software to be used in programming or configuring a computing device.
Embodiments of the methods disclosed herein may be performed in the operation of such computing devices. The order of the blocks presented in the examples above can be varied—for example, blocks can be re-ordered, combined, and/or broken into sub-blocks. Certain blocks or processes can be performed in parallel.
The use of “configured to” herein is meant as open and inclusive language that does not foreclose devices adapted to or configured to perform additional tasks or steps. Headings, lists, and numbering included herein are for ease of explanation only and are not meant to be limiting.
While the present subject matter has been described in detail with respect to specific embodiments thereof, it will be appreciated that those skilled in the art, upon attaining an understanding of the foregoing, may readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, it should be understood that the present disclosure has been presented for purposes of example rather than limitation, and does not preclude inclusion of such modifications, variations, and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art.