Machine learning (ML) is becoming an increasingly important part of the computing landscape. Machine learning is a branch of artificial intelligence (AI), and ML helps enable a software system to learn to recognize patterns from data without being directly programmed to do so. Neural networks (NN) are a type of ML which utilize a set of linked and layered functions (e.g., nodes, neurons, etc.) which are weighted to evaluate input data. In some NNs, sometimes referred to as convolution NNs (CNNs), convolution operations are performed in NN layers based on inputs received and weights. CNNs are often used in a wide array of applications typically for recognition and classification, such as image recognition and classification, prediction and recommendation systems, speech and language recognition and translation, etc. Layers in NNs may perform many types of functions, including, but not limited to, convolution, deconvolutional, pooling, up-sample, matrix multiplication, etc. Certain layers of NN may also perform operations such as resizing the inputs for use by another layer. For example, a layer of a NN recognizer may resize a portion of an input image to the layer (e.g., as intermediate data) for output to another layer for semantic segmentation of the portion of the input. Techniques for increasing performance of resizing operations may be useful.
This disclosure relates to a technique for resizing data via matrix multiplication including receiving input data values for resizing. The technique further includes placing a first number of data values from a first line of data values of the input data values in a first portion of a first vector. The technique also includes placing the first number of data values from a second line of data values of the input data values in a second portion of the first vector. The technique further includes receiving a first matrix of weights, wherein each weight of the first matrix of weights corresponds to an amount of weight to apply to a data value for a point on a first line of a set of resized data. The technique also includes multiplying the first vector and the first matrix of weights to determine data values for the first line of the set of resized data and outputting the set of resized data.
Another aspect of the present disclosure relates to an electronic device, the electronic device including one or more processors, wherein the one or more processors are configured to execute instructions causing the one or more processors to receive input data values for resizing. The instructions further cause the one or more processors to place a first number of data values from a first line of data values of the input data values in a first portion of a first vector. The instructions also cause the one or more processors to place the first number of data values from a second line of data values of the input data values in a second portion of the first vector. The instructions further cause the one or more processors to receive a first matrix of weights, wherein each weight of the first matrix of weights corresponds to an amount of weight to apply to a data value for a point on a first line of a set of resized data. The instructions also cause the one or more processors to multiply the first vector and the first matrix of weights to determine data values for the first line of the set of resized data and output the set of resized data.
Another aspect of the present disclosure relates to a non-transitory program storage device comprising instructions stored thereon to cause one or more processors to receive input data values for resizing. The instructions further cause the one or more processors to place a first number of data values from a first line of data values of the input data values in a first portion of a first vector. The instructions also cause the one or more processors to place the first number of data values from a second line of data values of the input data values in a second portion of the first vector. The instructions further cause the one or more processors to receive a first matrix of weights, wherein each weight of the first matrix of weights corresponds to an amount of weight to apply to a data value for a point on a first line of a set of resized data. The instructions also cause the one or more processors to multiply the first vector and the first matrix of weights to determine data values for the first line of the set of resized data and output the set of resized data.
For a detailed description of various examples, reference will now be made to the accompanying drawings in which:
As ML has becoming more common and powerful, hardware, such as embedded device, have configured to execute ML models has been introduced. As used herein, an ML model may refer to an implementation of one or more ML algorithms which model a behavior, such as object recognition, behavior of a circuit, behavior of a neuron, etc. In many cases, an executing a ML model may resize data. To help increase performance for resizing input data, some devices, such as a system on a chip (SoC), digital signal processor (DSP), embedded system, etc., may include a hardware resizer, which may be circuits on the device dedicated for resizing data. However, in some cases, the hardware resizer may not share certain memories, such as cache memories like a level 1 (L1) or level 2 (L2) cache, with a processor, such as a central processing unit (CPU), graphics processing unit (GPU). ML accelerator, etc. For example, ML model may be executing on a CPU and/or ML accelerator of a device using an L2 cache. This L2 cache may be a relatively high speed, on-chip memory, such as static random access memory (SRAM). The hardware resizer may not be able to directly access this cache memory. In cases where a resize layer of the ML model is sandwiched between other layers, data for input to the resize layer may need to be written from the cache memory to another memory prior to executing the resize layer on the hardware resizer. This other memory may be relatively slower than lower level cache memories, such as the L2 cache memory, or L1 cache memory. Loading data from the ML model into another memory (and the subsequent reloading of output of the resize layer into L2) can slow down the execution of the resize layer specifically and the ML model as a whole. To help increase ML execution performance, a technique to perform data resizing within a matrix multiplication accelerator may be used.
The processing cores, such as CPU cores 102 and MMAs 104, may be able to access one or more common caches shared between the processing cores. In this example, the processing cores may access a L1 cache 106, an L2 cache 108 and an L3 cache 110. In some cases, the processing cores and one or more caches, here the L1 cache 106 and L2 cache 108, may be integrated on a single chip, such as a SoC 112. The L3 cache 110 and external memory 114 may be on one or more chips separate from the SoC 112. Additionally, an external memory 114 may also be provided. The external memory 114 may be any kind of memory storage including double data rate (DDR) random access memory, other types of random access memory, flash memory, hard disk, etc.
In many cases, access to the memories and other components on the SoC 112, such as between the CPU cores 102, MMAs 104, L1 cache 106, and L2 cache 108, may be substantially faster than accessing components separate from the SoC 112, such as the L3 cache 110 and external memory 114. As such, it can be more efficient to do as much processing as possible when executing the ML model on components within the SoC 112.
Often when resizing data, additional data is added to enlarge the data from the original size to the new size. In many cases, this additional data is not simply a copy of the existing data. Rather, this additional data may be interpolated from the existing data. Interpolation estimates new data values based on existing data values using one or more functions. One technique that may be used when interpolating one dimensional data is linear interpolation based on the existing data. Linear interpolation applies essentially a weighted average of the existing data points to estimate the additional data. Linear interpolation may be extended into multiple dimensions conceptually as bilinear, trilinear, etc. interpolation. Data resize via interpolation may be applied to any type of data including, but not limited to image data, sensor data, set of numbers, etc.
In certain cases, weights may be predetermined based on distance for each of the four nearest input data points. For example, a size of the input data may be known, as well as distances as between the input data points, and how much resizing to perform (e.g., 2×, 4×, 6×, etc.). From this, weights may be predetermined for each of the generated data points to be determined. The predetermined weights may then be stored and loaded along with the ML model. Where multiple resizings are performed, different set of predetermined weights may be stored for the resizings. For example, a 2× resizing may not share predetermined weights used for a 4× resizing. In some cases, a set of predetermined weights may be shared across more than one resize operations. For example, multiple 2× resizings may use the same set of predetermined weights. In some cases, weighting based on distances from the neighbors may be referred to as nearest neighbor interpolation. It should be understood that while the example provided applies nearest neighbor interpolation, the specific technique for determining and applying weights to the input data points does not matter so long as the weights are deterministic for each generated data point.
In this example, the nearest input data points are weighted with integers such that the closest input data point (of the nearest input data points) is weighted with a value of 9, the next two closest input data points (which are equally distant) are weighted with a value of 3, and the furthest input data point (of the nearest input data points) is weighted with a 1. The nearest input data points for generated data points 210, 212, 214, and 216 are input data points 202, 204, 206, and 208. For generated data point 210, the closest to input data point is input data point 202 and the value of input data point 202 may be weighted with a value of 9. Values of input data point 204 and input data point 206 are weighted with a value of 3 and the value of input data point 208 is weighted with a value of 1. Thus, the value of generated data point 210 is a function (9a+3b+3d+e)/16. Similarly, the value of generated data point 212 is (9b+3a+3e+d)/16, the value of generated data point 214 is (9d+3a+3e+b)/16, and the value of generated data point 216 is (9e+3b+3d+a)/16.
Calculating the values for the generated data on a conventional CPU for a 2× resize operation on four input data points then requires 16 multiplication operations (four for each generated data point) as well as 16 register level data movements, and four elements round and shift operations (one for each generated data point for the division operation). To help reduce a number of operations needed for generate the generated data points, it may be beneficial to use the MMAs 104 to parallelize multiple multiplication operations for execution together.
Values in the vector 320 may be multiplied against values of the matrix 340 in a specific way as a part of matrix multiplication by the MMA based on matrix multiplication principles. In this example, values for the generated data points in generated row 1 of diagram 300 may be determined by the MMA by multiplying a value of cell 322A with a weight in col. 1, row 1 of matrix 340 (i.e., 9a); multiplying a value of cell 322B with a weight in col. 1, row 2 (i.e., 3a); multiplying a value of cell 322C with a weight in col. 1, row 3 (i.e., Ob) and so forth until the values of vector 320 are multiplied with respective values in col. 1 of matrix 340. These multiplied values may then be summed to determine the value of the generated data point in generated row 1, generated col. 1 of diagram 300. Of note, as values for the second line of input data points were concatenated with input data points of the first line, weighted values for all four nearest input data points are a determined in a matrix multiplication pass of the MMA for a generated data point.
Similarly, a value for the generated data point in generated row 1, generated col. 2, of diagram 300, may be determined by the MMA by multiplying a value of cell 322A with a weight in col. 2, row 1 of matrix 340, then multiplying a value of cell 322B with a weight in col. 2, row 2, then multiplying a value of cell 322C with a weight in col. 2, row 3, and so forth until the values of vector 340 are multiplied with respective values in col. 2 of matrix 340. These multiple values may then be summed to determine the value of the generated data point in generated row 1, generated col. 2, of diagram 300. This process continues by multiplying each value in vector 320 with values in a corresponding column of matrix 340. The MMA may perform two or more of the multiplications of the vector values and the matrix values in parallel. The results of the multiplication of vector 320 and matrix 340 may be output as a vector which includes the values for the generated data points in generated row 1 of diagram 300.
Similarly, to determine values for the generated data points in generated row 2 of diagram 300, the vector 340 may be multiplied against matrix 360 of
In this example, to determine values for generated rows 3 and 4, input data from input rows 2 and 3 are used. Similar to vector 320, vector 380 shows values that may be input to the MMA. More specifically, here, vector 380 includes values of input data points corresponding to input row 2 of diagram 300 concatenated with input data points corresponding to input row 3 of diagram 300. As shown, vector 380 of
In some cases, a set of vectors (e.g., as a matrix) of values of input data points may be prepared prior to beginning the matrix multiplication of vectors of the set of vectors against the matrices. For example, a matrix including vector 320 and vector 380 may be prepared and then vector 320 may be multiplied against matrix 340 and matrix 360 and vector 380 multiplied against matrix 340 and matrix 360.
In some cases, a size (e.g., width and/or height) of the input data points exceeds the size of the vectors. For example, an MMA may be able to support multiplying a vector of a width K with a matrix with a width×depth of L×L, but the input data points may have width (e.g., a column size) M where M>L. In such cases, the input data points may be broken down into smaller sets of vectors for use with the MMA.
A second vector, such as vector 420 of
In some cases, loading input data values into the vectors may be performed by the MMA. It may be understood that while the loading of the input data values may be performed by input/output components around (e.g., coupled to) core MMA hardware accelerator components, these components may be considered as part of the MMA. When performed by the MMA, placing less than K input data values into a vector at a time may be relatively inefficient as multiple external memory reads may be used per vector. In some cases, the MMA may be able to create multiple vectors.
Once the values have been copied, the second portion of the third vector 510 may be swapped for the first portion of the fourth vector 512 to generate a fifth vector 514. Similarly, the first portion of the fourth vector 512 may be swapped for the second portion of the third vector 510 to generate a sixth vector 516. In some cases, the vector 514 and the sixth vector 516 may represent modified versions of the first vector 502 and second vector 504, respectively. The fifth vector 514 and the sixth vector 516 may be multiplied with a corresponding matrix of weights, such as matrix 340 or 360. As modifying the vectors may occur within the MMA without additional memory requests to memories outside of the MMA, modifying a vector after loading data into the vector may be performed faster than loading the specific values from memory into the vector.
Additional vectors may be loaded from the first and second lines until the end of the lines of input grid 400 are reached. Of note, a value in the K position of a vector may be dropped (e.g., in third vector 510 and fourth vector 512) as compared to the originally loaded values of the first vector 502 and second vector 504 and an overlap of values from the previous vector may be applied. Therefore, the next vector, such as a seventh vector 518, may be loaded starting from the K-2 input data value of the respective rows. For example, the seventh vector 518 may load values from the first line starting at the K-2 input data value of the first vector 502, here the value g. Similarly, an eighth vector 520 may load values from second line starting at the K-2 input data value of the second vector 504, here the value y.
In some cases, the left and right pads 608 and 610 may be dynamically determined. For example, pad values corresponding to the left and right pads 608 and 610 may be added to vectors as needed after loading values of input data points into the vector. The pad values may be added along with coping the overlap data values and swapping a second portion of a first vector with a first portion of a second vector as described in conjunction with
In some cases, a determination whether pad values may be added to a vector may be based on a permute pattern. In some cases, the permute pattern may be predetermined. For example, permute patterns may be determined based width of the input data points. These permute patterns may be stored and loaded along with the ML model. In some cases, the permute patterns may be included with the ML model.
As an example, values of input data points may be read into vector 650
For cell 5 of vector 650, the index number 632 from corresponding place 5 of the permute pattern 630 is 4. Thus, the value of cell 5 of the vector 650 is set to the value of cell 4 of the vector 650 (i.e., d) based on the index number 632 from place 5 of the permute pattern 630, as shown in vector 652. By using the value of a first cell specified in the index number of the permute pattern 630 for determining a value for a second cell, the permute pattern 630 may be used to apply padding as well as overlapping for the vectors. In some cases, the permute pattern 632 may be predetermined. For example, the permute pattern may be determined based on expected input grid 602 dimensions and/or vector/matrix dimensions supported by the MMA. The permute pattern may be determined, for example as a part of development of a ML model and stored, for example, along with the ML model. In some cases, the permute pattern have a same width as the input grid 602, here M+2. In some cases, a single permute pattern may be reused for each row of the input grid. After the permute pattern 632 is applied to obtain the padded and/or overlapped vector 652, the vector 652 may be multiplied against a matrix of weights.
In some cases, a set of predetermined matrices of weights may be used. For example, as discussed above in conjunction with
It may be observed that weights to be applied to input data points in generated row 3 704C and generated row 4 704D are an inverse of weights to be applied to input data points in generated row 2 704B and generated row 1 704A, respectively. In some cases, such as when performing a 4× resizing, determining values for generated row 1 704A and generated row 2 704B may be performed in a manner substantially similar to that described in conjunction with
The MMA may then be configured to invert the order in which the input rows are placed into the vectors to determine values for generated row 3 704C and generated row 4 704D. To invert the order in which input rows are placed into the vectors, an indication may be provided to the MMA. In some cases, a flag may be set to apply the inverted order.
The MMA may then place values of input data points from input row 2 702B into a third vector 726 (with padding and overlapping). Similarly, values of input data points from input row 1 702B into a fourth vector 728 (with padding and overlapping). A second portion of the third vector 726 may be swapped for the first portion of the fourth vector 728 to generate a fifth vector 730. The third vector 726 may then be multiplied against the first matrix of weights (not shown) to determine values for generated row 3 704C. This first matrix of weights is the same matrix of weights used for determining values for generated row 1 704A. Similarly, the first portion of the fourth vector 728 may be swapped for the second portion of the third vector 726 to generate a sixth vector 732. The fourth vector 728 may then be multiplied against the second matrix of weights (not shown) to determine values for generated row 4 704D. This second matrix of weights is the same matrix of weights used for determining values for generated row 2 704B.
In some cases, the MMA may output results from the matrix multiplication sequentially for multiple generated rows. The MMA may accept certain sized inputs, such a vector having a width of K and a matrix having dimensions of L×L. Sometimes, a number of columns of input data points in an input row does not divide evenly by the width of vectors processed by the MMA. In such cases, for a last vector of input data points for a row, not all cells of the vector may be used and the remaining cells of the vector may be filled with unneeded data, such as zeros. When outputting values of a matrix multiplication using the last vector for the generated rows, a masked write may be used to avoid overwriting other data based on the unneeded data.
At block 808, a first matrix of weights is received, wherein each weight of the first matrix of weights corresponds to an amount of weight to apply to a data value for a point on a first line of a set of resized data. For example, resizing input data includes creating new generated data based on the input data. A generated data point of the generated data may be based on a weighted average of values of the input data at a number of points closest to the generated data point. The matrix of weights may describe weights to apply to the closest input data points for a line of generated data points. At block 810, the first vector is multiplied with the first matrix of weights to determine data values for the first line of the set of resized data. For example, the MMA may perform a matrix multiplication between the vector of input data values and the matrix of weights. At block 812 the set of resized data is output.
In this description, the term “couple” may cover connections, communications, or signal paths that enable a functional relationship consistent with this description. For example, if device A generates a signal to control device B to perform an action: (a) in a first example, device A is coupled to device B by direct connection; or (b) in a second example, device A is coupled to device B through intervening component C if intervening component C does not alter the functional relationship between device A and device B, such that device B is controlled by device A via the control signal generated by device A.
A device that is “configured to” perform a task or function may be configured (e.g., programmed and/or hardwired) at a time of manufacturing by a manufacturer to perform the function and/or may be configurable (or re-configurable) by a user after manufacturing to perform the function and/or other additional or alternative functions. The configuring may be through firmware and/or software programming of the device, through a construction and/or layout of hardware components and interconnections of the device, or a combination thereof.
Unless otherwise stated, “about,” “approximately,” or “substantially” preceding a value means+/−10 percent of the stated value. Modifications are possible in the described examples, and other examples are possible within the scope of the claims.
This application claims priority to U.S. Provisional Application No. 63/173,581, filed Apr. 12, 2021, which is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63173581 | Apr 2021 | US |