This application claims priority to India Provisional Application No. 202041025785, filed Jun. 18, 2020, which is hereby incorporated by reference.
Machine learning (ML) is becoming an increasingly important part of the computing landscape. Machine learning is a type of artificial intelligence (Al) and ML helps enable a software system to learn to recognize patterns from data without being directly programmed to do so. Neural networks (NN) are a type of ML which utilize a set of linked and layered functions (e.g., node, neuron, etc.) which are weighted to evaluate input data. In some NNs, sometimes referred to as convolution neural networks (CNNs), convolution operations may be performed in NN layers based on inputs received and weights. A convolution operation is a mathematical transformation applied to two functions to produce a third function which expresses how the shape of one function is modified by the second function. Examples of CNNs include deconvolutional neural networks, pooling neural networks, up-sample neural networks, deep neural networks, etc. CNNs are often used in a wide array of applications typically for recognition and classification, such as image recognition and classification, prediction and recommendation systems, speech and language recognition and translation, etc.
As ML becomes increasingly useful, there is a desire to execute complex ML techniques, such as NNs and CNNs, efficiently in devices with relatively limited compute and memory resources, such as embedded, or other low-power devices. To help efficiently run a given ML model on target hardware resources, the ML model may be analyzed and optimized to run using super tiling to tailor the ML model for the target hardware resources to be used.
This disclosure relates to a technique for enhancing ML model execution. The technique includes determining an amount of memory used to process layers of a machine learning network having multiple layers, smoothing the amount of memory used to process the layers of the machine learning network based on a number of layers, identifying change layers where the smoothed amount of memory used changes more than a memory change threshold amount, grouping the layers of the machine learning network into a first layer grouping based on the identified change layers, and outputting the first layer grouping.
Another aspect of the present disclosure relates to a non-transitory program storage device comprising instructions stored thereon to cause one or more processors to: determine an amount of memory used to process layers of a machine learning network having multiple layers, smooth the amount of memory used to process the layers of the machine learning network based on a number of layers, identify change layers where the smoothed amount of memory used changes more than a memory change threshold amount, group the layers of the machine learning network into a first layer grouping based on the identified change layers, and output the first layer grouping.
Another aspect of the present disclosure relates to device, comprising: a memory, and one or more processors operatively coupled to the memory, wherein the one or more processors are configured to execute non-transitory instructions causing the one or more processors to: determine an amount of memory used to process layers of a machine learning network having multiple layers, smooth the amount of memory used to process the layers of the machine learning network based on a number of layers, identify change layers where the smoothed amount of memory used changes more than a memory change threshold amount, group the layers of the machine learning network into a first layer grouping based on the identified change layers, and output the first layer grouping.
For a detailed description of various examples, reference will now be made to the accompanying drawings in which:
In certain cases, a tensor may be split into tiles for processing, as shown in tensor 200 of
Generally, it is advantageous to be able to store as much information required to execute a CNN in a memory as close as possible to the processor to help performance. Generally, memory close to a processer may be referred to as on-chip memory, while memory that is relatively further from the processor may be referred to as system memory, main memory, or random-access memory (RAM), and even further memory may be referred to as storage, disk, or hard disk. Examples of on-chip memory include static random-access memory (SRAM) and cache memory. Cache memory may further be divided into levels, such as level 1 (L1), level 2 (L2), and level 3 (L3), with higher numbers generally indicating that the cache is further away (e.g., slower to access) from the processor. As an example of processing an intermediate input tensor in a corresponding layer, the input tensor may be stored in a level 3 (L3) memory cache, while weights, CNN model, and input the and output information are stored in a level 2 (L2) cache. As portions of the tensor are processed, output may be stored temporarily in L2 cache and then output to another intermediate tensor, for example, in L3 cache as the input tensor is processed. Outputting the next tensor into the L3 cache helps prepare the system to process the next layer. In certain cases, the initial input tensor and final output may be stored in system memory. Storing and accessing intermediate tensors entirely in cache helps reduce the need to access external memory, such as system memory, like double data rate (DDR) memory, which can take a number of dock cycles (e.g., processing cycles) and reduce processing efficiency as the processor may need to stall while waiting for data.
While the size of a memory may be fixed, the size required by an intermediate tensor can vary. For example, a CNN may have a half megabyte (MB) sized input tensor and may be associated with two intermediate tensors of 5 MB and 12 MB, respectively. If, for example, a near processor memory such as a L3 cache is only 8 MB, the 12 MB intermediate tensor will not be able to entirely fit within the L3 cache and a portion of the 12 MB intermediate tensor will likely be stored in system memory. As memory access to system memory take substantially longer than accessing cache memory, in this case, processing times for the 12 MB intermediate tensor would be bottlenecked by memory input/output times.
In certain cases, a portion of an input tensor is overwritten by a corresponding output of processing that portion of input tensor.
Each of the layers discussed in this example are 3×3 convolution layers. In a 3×3 convolution layer, each tile is processed along with one neighboring tile in each dimension for the layer. Each tensor includes two zero pads, represented by the −1 and 20 entries. These zero pads may be used as neighboring tiles when processing tiles on the edge of a given tensor. Here at the end of each super tile pass, the fourth tensor 408 has five completed tiles 410. As each layer is a 3×3 convolution layer, tile 5 of the third tensor 406A is used to generate tile 4 of the fourth tensor 408A. likewise, tile 6 of the second tensor 404A is used to generate tile 5 of the third tensor 406A, and so forth. After the first super tile pass is completed, the second super tile pass is performed. As with the first super tile pass, five completed tiles 412 are generated after the second super tile pass the completed. As discussed in conjunction with
Each super tile group may be associated with certain super tile group properties. These super tile group properties may include properties such as a number of layers in the super tile group, tile heights associated with the layers, and a context memory. In this example, the number of layers in a first super tile group 502A includes four layers 504, here layers 1, 2, 3, and 4. A second super tile group 502B, in this example, also includes four layers 518, here layers 5, 6, 7, and 8. It may be understood that each super tile group may have a different number of layers. Each layer may be associated with one or more tile heights. In some cases. each layer may be associated with a first tile height, a normal tile height, and a last the height. The first tile height may indicate a number of tiles for each layer during the first run. In some cases, the first run may be a virtual or prewarming super tile pass, here labeled as pass 0506. The virtual super tile pass may not produce a completed tile in the last tensor of the layer group. Rather, the virtual super tile pass computes a set of tiles which overlaps with tiles of the next, normal super tile pass and stores these (e.g., backed up) computed tiles for the next pass. In this example, the first tile height, for the first layer is 3, the second layer is 2, the third layer is 1, and the fourth layer is 0.
The normal tile height may indicate a number of tiles for each layer during a steady state run of the super tile passes, here labeled as pass 1508, pass 2510, and pass 3512. In this example, the normal tile height for all of the layers is 5. It may be understood that the normal tile height for each layer may be different. The last tile height indicates a number of tiles for each layer for the last pass, here pass 4514, of the super tile run. In this example, the last tile height, for the first layer is 2, the second layer is 3, the third layer is 4, and the fourth layer is 5.
The context memory super tile group property refers to the stored or backed up tiles 516 for the passes. In this example, the context memory size is six tiles.
Super tile groups and associated super tile group properties may be defined for a CNN to help tailor the execution of the CNN for certain hardware resources. Each CNN may have a unique combination of a number of layers, tensor dimensions for each layer, and what each layer may be doing. For example, certain layers, such as layers performing a pooling function, convolution function, etc., may be associated with a down-sampling property where the layer takes an input tensor of a certain dimension and outputs a tensor with reduced dimensions. Other layers, such as layers performing a resizing function, deconvolution function, etc., may be associated with an up-sampling property where the layer takes an input tensor of a certain dimension and outputs a tensor with increased dimensions,
To help tailor the execution of the CNN for a given hardware resource, the CNN may be modeled to determine a total volume of memory (e.g. an amount of memory) needed for each layer of the CNN. This total volume of memory may include all memory needed to execute the layer of the CNN, including memory needed for the input tensor(s), output tensor(s), backed up tiles, operational parameters needed for the layer, etc. Super tile groups may be defined based on this total volume of memory.
Based on the windowed total volume data, points where the total volume changes by a certain amount, which may be referred to as a volume change factor, may be identified. These identified points may be used to determine initial boundaries for the super tiling groups. In the example line graph 650, points may be identified between layers 5 and 6, layers 12 and 13, layers 24 and 35, and layers 49 and 50. While in this example there is a total volume change between layers 33 and 34 and layers 54 and 55, the total volume change at these points may be below the volume change factor and thus these points are not identified. Thus, five super tiling groups may be defined as including layers [1:5]. [6:12], [13:24], [25:49], and [50:64]. If a relatively smaller volume change factor had been used, additional super tiling groups may be defined, such as [1:5], [6:12], [13:24], [25:49], [50:54], [55:64] or [1:5], [6:12], [13:24], [25:33], [34:49], [50:54], [55:64]. In certain cases, the volume change factor may be predetermined, for example, as a default value, received from a user, etc. In other cases, the volume change factor may be determined based on one or more factors, for example, based on a cache or memory size, a maximum total volume across all layers, ratio of maximum total value to minimum total value, etc. The volume change factor may be chosen to balance noise reduction and a number of points identified. In some cases, multiple volume change factors may be used to determine multiple sets of super tiling groups for comparison, for example, via performance simulations (e.g., modeling).
After the super tiling groups are identified, the super filing groups may be refined. In some cases, super tiling groups may be refined based on a cost minimization performed across super tiling group variants. For example, an initial super tiling group variant may be the super tiling groups as identified based on the total volume changes. A cost factor may be determined and associated with this initial super tiling group variant. This cost factor may be determined based on performance simulations (e.g., modeling) of the CNN being executed using the initial super tiling group variant. The performance simulations may account for memory access latencies, processing speed, and power consumption for a target hardware resource (e.g., the hardware resource CNN execution is being optimized for). The cost factor is then associated with the initial super tiling group variant. A variant of the super tiling group is then determined by moving one or more group boundaries of the super tiling group within a refinement range N of the initial group boundary. In some cases, the refinement range may be both positive and negative and this range may be relatively small. As an example, an initial group boundary 654 may be identified between layers 24 and 25 between initial super tiling groups [13:24], [25:33]; and a refinement range of N=1. The two determined variants of the initial group boundary then may be [13, 23], [24, 33], and [13, 25], [26, 33]. These determined variants may then be evaluated via performance simulations and associated with a cost factor. The variant with the relatively smallest cost factor may be selected as a final super tiling group configuration. In some cases, each group boundary of the initial group boundaries may be refined. In some cases, one group boundaries with a total volume change over or under a certain threshold size may be refined. In some cases, such as when two super thing groups are within the refinement range of each other, the two super tiling groups may be merged. In some cases, different step sizes for the refinement range may be used, for example, adjusting the group boundary by two layers rather than one layer.
In accordance with aspects of the present disclosure, a tile height and number of tiles may be configured for a super tiling group. In some cases, this determination may be based on back propagation from a tile height for the last layer of the super tiling group, such as layer 4 in the example shown in
At block 718, if there are sets of super tile groups that have not been refined, at block 720, the CNN may be modeled to determine cost factor for a super tile group boundary within a refinement range. For example, a CNN may be modeled by executing the CNN with simulated inputs and using a super tile grouping being modeled. The modeling may use simulated target hardware, such as by using a virtual machine, and record operational information, such as memory usage, latencies of the memories being used, processor usage, power consumptions, etc. In some cases, each variant of a super the group boundary within a refinement range may be simulated and a cost factor associated with the variant. At block 722, the variant with the lowest cost factor of the variants of the super tile group boundary within the refinement range may be selected as the super tile group boundary. At block 724, if there are additional super tile group boundaries to evaluate, execution returns to 720 to evaluate those additional super tile group boundaries. If there are no more super tile group boundaries to evaluate, execution returns to 718. If there are no additional sets of super tile groups to evaluate at block 718, then, if there are multiple sets of refined super tile groups, at block 726, cost factors across the multiple sets of refined super tile groups are compared to select a set of refined super tile groups with a lowest cost factor at block 728. Otherwise, the refined super tile groups are output at block 730.
As illustrated in
The processor 905 is operatively and communicatively coupled to on-chip memory 925, such as a cache memory, SRAM, registers, etc. With respect to cache memory, cache memory may include one or more L1 caches, one or more L2 caches, and one or more L3 caches. The L1 cache may be integrated in a package with the processor 905. The L2 and/or L3 caches may also be integrated in the processor package or may be in a package separate from the processor package. In certain cases, the L2 and/or L3 caches, or portions thereof may be integrated with a memory controller, which helps manage memory traffic to the processor 905.
Persons of ordinary skill in the art are aware that software programs may be developed, encoded, and compiled in a variety of computing languages for a variety of software platforms and/or operating systems and subsequently loaded and executed by processor 905. In one example, the compiling process of the software program may transform program code written in a programming language to another computer language such that the processor 905 is able to execute the programming code. For example, the compiling process of the software program may generate an executable program that operates a ML network.
After the compiling process, the encoded instructions may then be loaded as computer executable instructions or process steps to processor 905 from storage 920, from memory 910, and/or embedded within processor 905 (e.g., via a cache or on-board ROM). Processor 905 may be configured to execute the stored instructions or process steps in order to perform instructions or process steps to transform the computing device into a non-generic, particular, specially programmed machine or apparatus. Stored data, e.g., data stored by a storage device 920, may be accessed by processor 905 during the execution of computer executable instructions or process steps to instruct one or more components within the computing device 900. Storage 920 may be partitioned or split into multiple sections that may be accessed by different software programs. For example, storage 920 may include a section designated for specific purposes, such as storing program instructions or data for updating software of the computing device 900. In one example, the software to be updated includes the ROM, or firmware, of the computing device. In certain cases, the computing device 900 may include multiple operating systems. For example, the computing device 900 may include a general-purpose operating system which is utilized for normal operations. The computing device 900 may also include another operating system, such as a bootloader, for performing specific tasks, such as upgrading and recovering the general-purpose operating system, and allowing access to the computing device 900 at a level generally not available through the general-purpose operating system. Both the general-purpose operating system and another operating system may have access to the section of storage 920 designated for specific purposes.
The one or more communications interfaces may include a radio communications interface for interfacing with one or more radio communications devices. In certain cases, elements coupled to the processor may be included on hardware shared with the processor. For example, the communications interfaces 925, storage, 920, and memory 910 may be included, along with other elements such as the digital radio, in a single chip or package, such as in a system on a chip (SOC). Computing device may also include input and/or output devices, not shown, examples of which include sensors, cameras, human input devices, such as mouse, keyboard, touchscreen, monitors, display screen, tactile or motion generators, speakers, lights, etc.
In this description, the term “couple” may cover connections, communications, or signal paths that enable a functional relationship consistent with this description. For example, if device A generates a signal to control device B to perform an action: (a) in a first example, device A is coupled to device B by direct connection; or (b) in a second example, device A is coupled to device B through intervening component C if intervening component C does not alter the functional relationship between device A and device B, such that device B is controlled by device A via the control signal generated by device A.
Modifications are possible in the described embodiments, and other embodiments are possible, within the scope of the claims.
Number | Date | Country | Kind |
---|---|---|---|
202041025785 | Jun 2020 | IN | national |