The present disclosure relates to the field of computers, and more specifically, the present disclosure relates to operator fusion of a neural network.
Image supersampling refers to restoring a low-resolution image to a high-resolution image, which is a very important technology in the fields of computer vision and image processing. The image supersampling is widely used in the fields of medical image, monitoring, and security. So far, a large number of classical supersampling methods have been proposed in academic and practical applications, including a prediction-based method, an edge-based method, a statistical method, a block-based method, and a sparse representation method.
With the rapid development of deep learning technology in recent years, deep learning-based supersampling technology has achieved very good results. Different deep learning methods have been applied to image supersampling, such as a convolution neural network and a generative adversarial network. Convolution operation is a foundation of a deep learning neural network, and while deep learning supersampling networks differ greatly from each other, they are essentially a set of operation components that include a lot of convolution computing. An output of a supersampling deep learning network is usually a high-resolution image with a large amount of data, which makes the space and computing amount required in the operation process very large.
Due to the computationally intensive nature of the deep learning neural network, a graphics processing unit (GPU) with multiple levels of on-chip storage or a deep learning-dedicated processor is currently commonly used for acceleration, including a deep learning-based supersampling network. However, inputs, outputs, and intermediate computing results of the deep learning-based supersampling network are very large, and generally may not reside in a multilevel high-speed on-chip storage of a processor, so the utilization rate of the multilevel high-speed on-chip storage is low, and the acceleration effect is not obvious.
One purpose of the present disclosure is to solve low utilization of a multilevel high-speed on-chip storage and low acceleration effect during the neural network operation process in the prior art.
A first aspect of the present disclosure provides a method for fusing operators of a neural network, including: constructing a directed computing graph of the neural network, where the directed computing graph includes a plurality of nodes connected by directed edges: traversing the plurality of nodes to determine whether a traversed node satisfies a preset condition; and determining an operator corresponding to the node that satisfies the preset condition as a to-be-fused operator to perform fusion to generate a fusion operator.
A second aspect of the present disclosure provides an electronic device, including: one or a plurality of processors; and a memory, on which a computer-executable instruction is stored, where when the computer-executable instruction is run by the one or the plurality of processors, the electronic device performs the above method.
A third aspect of the present disclosure provides a computer-readable storage medium, including a computer-executable instruction, where when the computer-executable instruction is run by one or a plurality of processors, the method described above is performed.
At least one beneficial effect of the present disclosure is that operators that satisfy a condition in a neural network may be fused, thus saving the overall running time of the neural network.
By reading the following detailed description with reference to drawings, the above-mentioned and other objects, features and technical effects of exemplary implementations of the present disclosure will become easier to understand. In the drawings, several implementations of the present disclosure are shown in an exemplary manner rather than a restrictive manner, and the same or corresponding reference numerals indicate the same or corresponding parts.
Technical solutions in embodiments of the present disclosure will be described clearly and completely hereinafter with reference to drawings in the embodiments of the present disclosure. Obviously, embodiments to be described are merely some rather than all embodiments of the present disclosure. All other embodiments obtained by those skilled in the art based on the embodiments of the present disclosure without creative efforts shall fall within the scope of protection of the present disclosure.
It should be understood that terms such as “first”, “second”, “third”, and “fourth” that appear in the claims, the specification, and the drawings are used for distinguishing different objects rather than describing a specific order. It should be understood that terms “including” and “comprising” used in the specification and the claims indicate the presence of a feature, an entity, a step, an operation, an element, and/or a component, but do not exclude the existence or addition of one or more of other features, entities, steps, operations, elements, components, and/or collections thereof.
It should also be understood that terms used in the specification of the present disclosure are merely intended to describe a specific embodiment rather than to limit the present disclosure. As being used in the specification and the claims of the present disclosure, unless the context clearly indicates otherwise, singular forms such as “a”, “an”, and “the” are intended to include plural forms. It should also be understood that a term “and/or” used in the specification and the claims refers to any and all possible combinations of one or more of relevant listed items and includes these combinations.
As being used in the specification and the claims of the present disclosure, a term “if” may be interpreted as “when”, or “once” or “in response to a determination” or “in response to a case where something is detected” depending on the context. Similarly, depending on the context, a clause “if it is determined that” or “if [a described condition or event] is detected” may be interpreted as “once it is determined that”, or “in response to a determination”, or “once [a described condition or event] is detected”, or “in response to a case where [a described condition or event] is detected”.
As shown in
The directed computing graph in
After the directed computing graph is constructed, each node in the computing graph may be traversed to determine characteristics of the node itself and a relationship between the node and other nodes, so as to determine whether a traversed node satisfies a preset condition for fusion. If the traversed node satisfies the preset condition, an operator corresponding to the node may be fused. It is required to be understood that in the present disclosure, terms operator and node are essentially the same. A node is just a symbolic representation of an operator, so the two may be used interchangeably.
According to an implementation of the present disclosure, in order to facilitate a fusion of operators, an optimization search sequence is created according to a plurality of nodes; and the plurality of nodes in the optimization search sequence are traversed to determine whether a traversed node satisfies a preset condition.
In judging whether an operator may be fused, nodes in
According to an implementation of the present disclosure, the directed computing graph may be traversed backwards to create the optimization search sequence. Creating the optimization search sequence according to the plurality of nodes includes: determining output edge counts in the directed edges of the nodes; and determining a node with 0 output edge as a candidate node and adding the node to a first queue to create the optimization search sequence. The backward traversal of the directed computing graph means starting from a final output node of the computing graph and traversing each node from downstream to upward.
Therefore, in creating the optimization search sequence, the node with 0 output edge may be first used as an initial node to be added to the optimization search sequence, and which nodes that may be fused with the initial node in the collection E0 are searched step by step.
As shown in
It is assumed that the optimization search sequence is represented by a sequence Qi, where i is a serial number. As described above, the sequence Qi is initially empty and represented as Q0. Since an output edge count of a node Add2 is 0, the node Add2 may be first added to the sequence Q0, thus forming a sequence Q1={Add2}.
Next, the node Add2 may be deleted from the collection E0, so the collection E0 is updated as E1={conv4, relu2, conv3, add1, conv2, relu1, conv1, conv_first}. Output edge counts of input nodes of the initial node Add2 are subtracted by 1. The input nodes of the node Add2 are a node Add1 and a node Conv4, so output edge counts of these two nodes Add1 and Conv4 are subtracted by 1, and the output edge counts of the nodes Add1 and Conv4 are updated to 1 and 0, respectively. As such, output edge counts of all nodes in the collection E1 are {0,1,1,1,1,1,1,2}, respectively.
At this time, an output edge count of the node Conv4 is 0, so the node Conv4 may be added to the sequence Q1, thus forming a sequence Q2={Add2, Conv4}.
Next, the node Conv4 may be deleted from the collection E1, so the collection E1 is updated as E2={relu2, conv3, add1, conv2, relu1, conv1, conv_first}. An output edge count of an input node of the node Conv4 is subtracted by 1. The input node of the node Conv4 is a node Relu2, so an output edge count of the node Relu2 is subtracted by 1. As such, output edge counts of all nodes in the collection E2 are {0, 1, 1, 1, 1, 1, 2}, respectively.
At this time, an output edge count of the node relu2 is 0, so the node relu2 may be added to the sequence Q2, thus forming a sequence Q3={Add2, Conv4, Relu2}.
Next, the node Relu2 may be deleted from the collection E2, so the collection E2 is updated as E3={conv3, add1, conv2, relu1, conv1, conv_first}. An output edge count of an input node of the node Relu2 is subtracted by 1. The input node of the node Relu2 is a node Conv3, so an output edge count of the node Conv3 is subtracted by 1. As such, output edge counts of all nodes in the collection E3 are {0, 1, 1, 1, 1, 2}, respectively.
At this time, an output edge count of the node Conv3 is 0, so the node Conv3 may be added to the sequence Q3, thus forming a sequence Q4={Add2, Conv4, Relu2, Conv3}.
In this way, all nodes in the directed computing graph may be added to the sequence Qi to form an optimization search sequence. It may be understood that the optimization search sequence is Q9={add2, conv4, relu2, conv3, add1, conv2, relu1, conv1, conv_first} finally.
The above is only an illustrative explanation in combination with
The directed computing graph shown in
For example, a node Add2 is first stacked, followed by a node Conv4, a node Relu2, a node Conv3, and so on. Thus, when an output edge count of a certain node is 0, or when an output edge count of a certain node is updated to 0, the node is ejected from the stack. It is required to be understood that creating a stack is just a convenient way, and it is not necessarily required to create a stack.
After all nodes are set to the optimization search sequence in order according to the above method, a next step is to determine which nodes in the optimization search sequence may be fused.
As shown in
First, an empty second queue may be created. Then, a first node (such as a node Add2) in an optimization search sequence may be added to the second queue as a first to-be-fused node.
Next, starting from a second node of the optimization search sequence (the first queue), all nodes are traversed, and whether the traversed node satisfies the preset condition is determined. If the traversed node satisfies the preset condition, the traversed node is added to the second queue, so that the node may be fused with the first node in the second queue. However, if the traversed node does not satisfy the preset condition, then the traversed node is unable to be fused with the node in the second queue.
It is required to be understood that creating a second queue is only a more intuitive way, and it is not necessary to create a second queue to accommodate nodes that may be fused.
Next, multiple scenarios of the above preset condition will be described in detail.
According to an implementation of the present disclosure, if a first storage space occupied by the traversed candidate node is less than or equal to a second storage space occupied by a child candidate node of the candidate node, the traversed candidate node satisfies the preset condition.
An optimization search sequence Q9={add2, conv4, relu2, conv3, add1, conv2, relu1, conv1, conv_first} is stilled taken as an example for explanation.
First, a node add2 is added to a second queue. Next, starting from a node conv4, all other nodes are traversed. The node add2 is an adding node, and a size of output data of the node is usually not larger than a size of input data of the node.
Then, the node conv4 is traversed. The node conv4 is a convolution operation node. According to characteristics of a convolution operation, a size of output data of the convolution operation node conv4 is usually larger than a size of input data of the convolution operation node conv4. Therefore, a size of a storage space required to be occupied by the node conv4 should not be smaller than a size of a storage space occupied by the output data of the node conv4. It may be contemplated that in this situation, the size of the storage space occupied by the node conv4 is not larger than (is smaller than or equal to) a size of a storage space occupied by a child node add2 of the node conv4. Therefore, the node conv4 may be added to the second queue to be fused with the node add2.
Then, a node relu2 is traversed. The node relu2 is an in place operation node. The in place operation means that an operation takes place at an original storage location of data involved in the operation. Such an operation may not cause data to swell or shrink. Therefore, a size of a storage space occupied by the node relu2 is not larger than (is smaller than or equal to) a size of a storage space occupied by a child node conv4 of the node relu2.
Therefore, the node relu2 may be added to the second queue to be fused with the nodes add2 and conv4.
Next, a node conv3 is traversed. The node conv3 is a convolution operation node. According to characteristics of a convolution operation, a size of output data of the convolution operation node conv3 is usually larger than a size of input data of the convolution operation node conv3. Therefore, a size of a storage space required to be occupied by the node conv3 should not be smaller than a size of a storage space occupied by output data of the node conv3. It may be contemplated that in this situation, the size of the storage space occupied by the node conv3 is not larger than (is smaller than or equal to) a size of a storage space occupied by a child node relu2 of the node conv3. Therefore, the node conv3 may be added to the second queue to be fused with the nodes add2, conv4, and relu2.
Next, a node add1 is traversed. The node add1 is an adding node. As shown in
It is required to be understood that a storage space occupied by each node may be evaluated according to characteristics of each node or may be determined according to a storage space occupied by each node in actual operation.
According to the above principles, the nodes add1, conv2, relu1, and conv1 may also be determined as nodes that may be fused, which will not be described herein.
In the above, the storage space occupied by each node is used as a standard to determine whether the node may be fused, which is a basic judgment standard. According to another implementation of the present disclosure, if a traversed candidate node is an input node of a to-be-fused node in a second queue and the traversed candidate node is an in place operation node, it is determined that the first storage space is smaller than or equal to the second storage space.
A simpler and more straightforward way may be used to judge whether a node may be fused with a node in the second queue. In this implementation, if the traversed node is an input node of a certain node in the second queue and the traversed node is an in place operation node, it may be directly judged that this traversed node satisfies a preset condition. This is because a storage space occupied by output data of an in place operation node is not larger than a storage space occupied by input data of the in place operation node, so a storage space occupied by the in place operation node is not larger than a storage space occupied by a child node of the in place operation node.
Whether the traversed node satisfies the preset condition may also be determined in the following way. According to an implementation of the present disclosure, if a traversed candidate node is an input node of a to-be-fused node in a second queue and the traversed candidate node is a non-in-place operation node, but all to-be-fused nodes in the second queue are in place operation nodes, it is determined that the first storage space is smaller than or equal to the second storage space.
In this implementation, when a node in an optimization search sequence is traversed, if this node is not an in place operation node, it may be first determined whether a child node of this node has been added to to-be-fused nodes in the second queue and whether all nodes in the second queue are in place operation nodes. If this condition is satisfied, it means that all nodes after the traversed node do not occupy more storage space, so that the traversed node may be fused with all in place operation nodes in the second queue.
In the above implementations, whether a node may be fused is determined by judging a storage space occupied by the node. According to another implementation of the present disclosure, whether the node may be fused may also be judged based on running time of each operator (node).
According to an implementation of the present disclosure, if fusion running time after the traversed candidate node is fused with the to-be-fused node in the second queue is less than or equal to first running time, the traversed candidate node satisfies the preset condition, where the first running time is running time when the traversed candidate node is not fused with the to-be-fused node in the second queue.
The directed computing graph shown in
First, a node add2 is added to a second queue. Next, starting from a node conv4, nodes in an optimization search queue are traversed.
It may be contemplated that the node conv4 is a parent node of the node add2. After testing, it is obtained that running time of the nodes add2 and conv4 is 5 ms, while running time after the nodes add2 and conv4 are fused is 3 ms, and the time is reduced after fusion, so the two nodes may be fused. Then, this process continues to traverse a node relu2 backwards.
The node relu2 is a parent node of the node conv4, so the node relu2 may be fused with the nodes add2 and conv4. Then, this process continues to traverse a node conv3 backwards.
The node conv3 is a parent node of the node relu2. After testing, it is obtained that running time of the nodes add2, conv4, and relu2 after fusion and the node conv3 is 6 ms, while running time after the nodes add2, conv4, relu2, and conv3 are fused is 5 ms, and the time is reduced after fusion, so the nodes add2, conv4, relu2, and conv3 may be fused. Then, this process continues to traverse a node add1 backwards.
The node add1 is a parent node of the nodes add2 and conv3. After testing, it is obtained that running time of the nodes add2, conv4, relu2, and conv3 after fusion and the node add1 is 6 ms, while running time after the nodes add2, conv4, relu2, conv3, and add1 are fused is 7 ms, and the time is increased after fusion, so the node add1 may not be fused with the nodes add2, conv4, relu2, and conv3.
In the above, only independent running time of nodes and running time after nodes are fused are concerned. According to another implementation of the present disclosure, whether a node may be fused may be judged by considering both a storage space occupied by the node and running time of the node.
As shown in
Next, in the operation S620, whether the traversed candidate node is a non-in-place operation node is judged: if the traversed candidate node is the non-in-place operation node, an operation S630 is performed: if the traversed candidate node is not the non-in-place operation node, which means that the traversed candidate node is an in place operation node, this in place operation node is added to the second queue to be fused.
Next, in the operation S630, whether to-be-fused nodes in the second queue are not all in place operation nodes is judged: if the to-be-fused nodes in the second queue are not all in place operation nodes, an operation S640 is performed: if the to-be-fused nodes in the second queue are all in place operation nodes, the traversed node is added to the second queue to be fused.
Next, in the operation S640, if fusion running time after the traversed candidate node is fused with the to-be-fused node in the second queue is less than or equal to first running time, the traversed candidate node satisfies the preset condition, where the first running time is running time when the traversed candidate node is not fused with the to-be-fused node in the second queue. In this situation, the traversed node is added to the second queue to be fused.
Next, the directed computing graph shown in
First, a node add2 is added to a second queue. Next, starting from a node conv4, nodes in an optimization search queue are traversed.
It may be contemplated that the node conv4 is a parent node of the node add2 (a traversed candidate node is an input node of a to-be-fused node in the second queue), the nodes conv4 and add2 are not in place operation nodes (the traversed candidate node is a non-in-place operation node), and a node added to the second queue (the node add2) is not an in place operation node (which means that to-be-fused nodes in the second queue are not all in place operation nodes). After testing, it is obtained that running time of the nodes add2 and conv4 is 5 ms, while running completion time after the nodes add2 and conv4 are fused is 3 ms, and the time is reduced after fusion, so the two nodes may be fused. Then, this process continues to traverse a node relu2 backwards.
The node relu2 is a parent node of the node conv4, and the node relu2 is an in place operation node, so the node relu2 may be fused with the nodes add2 and conv4. Then, this process continues to traverse a node conv3 backwards.
The node conv3 is a parent node of the node relu2. The nodes conv3 and conv4 are not in place operation nodes. Moreover, not all nodes in the second queue are in place operation nodes. After testing, it is obtained that running time of the nodes add2, conv4, and relu2 after fusion and the node conv3 is 6 ms, while running completion time after the nodes add2, conv4, relu2, and conv3 are fused is 5 ms, and the time is reduced after fusion, so the nodes add2, conv4, relu2, and conv3 may be fused. Then, this process continues to traverse a node add 1 backwards.
The node add1 is a parent node of the nodes add2 and conv3. The nodes add1 and conv3 are not in place operation nodes. Moreover, not all nodes in the second queue are in place operation nodes. After testing, it is obtained that running time of the nodes add2, conv4, relu2, and conv3 after fusion and the node add1 is 6 ms, while running completion time after the nodes add2, conv4, relu2, conv3, and add1 are fused is 7 ms, and the time is increased after fusion, so the node add1 may not be fused with the nodes add2, conv4, relu2, and conv3.
It is required to be understood that operation time of a node may be obtained by actual measurement or may be estimated by establishing an operation model of an operator. In this field, any known or unknown method may be used to obtain operation time of an operator (node).
It is also required to be understood that for the operations S610-S630, although the preceding sequence is shown above, such judgment operations do not necessarily follow the sequence shown in
As shown in
Further, as shown in
The above describes a method for judging nodes that may be fused and corresponding operations. For large tensor data, how to efficiently split the data to match fused nodes is also a problem worth paying attention to.
As shown in
In the above, “sub-tensor” refers to a sub-tensor formed after tensor data is split, such as a small picture formed after a picture is split. The sub-tensor dimension variables refer to various variables that represent dimensions of this sub-tensor. Through these variables, sub-tensor functions may be formed.
It is required to be understood that the above operations may exist independently, which means that the above operations may not rely on the fusion of nodes described above, but simply split the tensor data, and the above operations may also reply on the fusion of nodes described above.
It may be understood that the tensor data may be represented in multiple dimensions. For example, a shape of a piece of tensor data is (720, 1080, 3), which represents that a height h of this piece of tensor data is 720, a width w of this piece of tensor data is 1080, and a channel count c of this piece of tensor data is 3. The channel count of the tensor data usually does not change depending on different operators, so the channel count c may be viewed as a constant. For the height h and width w of the tensor data, data for each operator may be different, so for each operator, the height h and width w may be viewed as variables.
Therefore, input data and output data of each operator may be represented by functions of h and w. A storage space occupied by each operator may also be represented by variables h and w. For a plurality of operators that may be fused, since a function of each operator may be different, but a maximum storage space occupied by each operator is a constant value, thus, specific variables h and w are required to be determined according to a splitting scheme of tensor data of each operator and a size of a preset storage space. After the h and w are determined, the tensor data may be split. The preset storage space may be an on-chip storage space of hardware.
As shown in
For the operation S8110, only operators that are required to be fused are formed into the directed computing sub-graph, which is the same as the way of forming the directed computing graph introduced in combination with
Next, the sub-tensor function of the output tensor of the first to-be-fused sub-node is determined. Taking to-be-fused sub-operators (nodes) add2, conv4, relu2, and conv3 as examples, a sub-tensor function of an output tensor of the node add2 (an output node count of this node is 0) may be first determined, such as F(h, w, 32).
After the sub-tensor function of the first to-be-fused sub-node is determined, according to each operator, a sub-tensor function of input data of the operator may be reversely derived in a depth-first manner. For example, a sub-tensor function of input data of the node add2 may be derived as (F(h, w, 32), F(h, w,32)): a sub-tensor function of input data of the node conv4 may be derived as (h+2, w+2, 32): a sub-tensor function of input data of the node relu2 may be derived as F(h+2, w+2, 32); and a sub-tensor function of input data of the node conv3 may be derived as F(h+4, w+4, 32). Therefore, shapes of tensors of input and output data of all nodes may be expressions based on variables h and W:
It is required to be understood that in the context, a symbol “F” is used to represent various kinds of sub-tensor functions. However, the function symbol F is merely a general term for functions, and does not mean that every function must be the same.
In a specific implementation, reversely deriving the sub-tensor function of each of other to-be-fused sub-node according to the directed computing sub-graph may include:
when the to-be-fused sub-node has a plurality of outputs, reversely deriving a maximum sub-tensor function of each of other to-be-fused sub-nodes.
A node may have a plurality of outputs. In this situation, rather than simply reversely deriving the input according to a certain output in a depth-first manner, the maximum sub-tensor function of the to-be-fused node should be reversely derived according to the plurality of outputs, and then, maximum tensor functions of all other to-be-fused nodes are derived. Determining the maximum tensor functions will help to obtain a maximum storage space occupied by each node.
As shown in
After a sub-tensor function of each to-be-fused sub-node is determined, a storage space occupied by data represented by the sub-tensor function may be determined. For example, a sub-tensor function of output data of a node add2 is (h, w, 32), so a storage space occupied by the output data may be represented as h*w*32. It is required to be understood that an expression for computing a storage space here is merely an illustrative representation, which is not necessarily the same as the way of actual computing of the occupied storage space. For example, this expression may omit some constants, and so on. For example, a size of each pixel is 2 bytes, so the storage space occupied by the output data of the node add2 may be actually expressed as h*w*32*2B. Here, a constant 2B is omitted for the sake of description.
After each sub-tensor function is acquired, an optimal value for each variable in the sub-tensor function may be determined by sizes of the sub-tensor function and the storage space. Given the use of storage space by some temporary scalars and function call stacks, it is possible to start to traverse w from a certain value. Here, the traversal starts with 1. For each w, a maximum h that satisfies h*w*32<=preset storage space may be found until there is no h that satisfies the constraint, and the traversal stops. In actual, a (h, w) with minimal actual test running time may be selected.
It is required to be understood that the sub-tensor functions above includes not only a sub-tensor function of output data of a certain node, but also a sub-tensor function of input data of that node and a sub-tensor function of intermediate data generated during operation. In order to facilitate understanding, the present disclosure mainly describes the sub-tensor function of the output data and the sub-tensor function of the input data by example.
According to an implementation of the present disclosure, determining the corresponding maximum sub-tensor dimension variable when the storage space occupied by the sub-tensor data is less than or equal to the preset storage space may include: determining a maximum storage space occupied by the output sub-tensor data, the intermediate sub-tensor data, and the input sub-tensor data; and when the maximum storage space is not greater than the preset storage space, determining a tensor dimension variable of sub-tensor data corresponding to the maximum storage space as the maximum sub-tensor dimension variable.
Further, in order to compute a maximum storage space occupied by a certain node, a maximum storage space occupied by all input data, output data, and intermediate data of the node is required to be computed.
Here, assuming that a maximum storage space required to be occupied by a certain node is m, a storage space occupied by output data of a node add2 is h*w*32; the node add2 is connected to two nodes, including a node conv4 and a node conv3 respectively, and maximum storage spaces occupied by these two nodes are represented as mconv4 and mconv3 respectively, so a maximum storage space occupied by the node add2 is madd2=max {mconv4+Mconv3, h*w*32}.
It is assumed that an input and output channel count for all convolution operators is 32, a size of a convolution kernel is (3, 3), and the height and width are filled as (1, 1). According to a mathematical definition of convolution, a sub-tensor function of a piece of input data of the node add2 is F(h+2, w+2, 32). Accordingly, a sub-tensor function of output data of the node conv4 is F(h+2, w+2, 32), so a storage space occupied by output data of the node conv4 is F((h+2)*(w+2)*32); a storage space occupied by input data of the node conv4 is represented as mrelu2, so a maximum storage space occupied by the node conv4 is Mconv4-max{mrelu2, (h+2)*(w+2)*32}.
A sub-tensor function of input data of the node conv4 is (h+2, w+2, 32). Accordingly, a sub-tensor function of output data of the node relu2 is (h+2, w+2, 32), so a storage space occupied by the output data of the node relu2 is (h+2)*(w+2)*32; a storage space occupied by input data of the node relu2 is represented as mconv3, so a maximum storage space occupied by the node relu2 is mrelu2-max {mconv3, (h+2)*(w+2)*32}.
A sub-tensor function of output data of the node conv3 is F(h+4, w+4, 32). Therefore, a storage space occupied by the output data of the node conv3 is (h+4)*(w+4)*32; a storage space occupied by input data of the node conv3 is represented as madd1, so a maximum storage space occupied by the node conv3 is mconv3-max {madd1, (h+4)*(w+4)*32}.
In this way, maximum storage spaces occupied by all nodes may be obtained.
For to-be-fused operators (nodes), it is expected that storage spaces occupied by all nodes are not greater than a preset storage space. A maximum sub-tensor function may be selected, so that a storage space corresponding to the sub-tensor function is not greater than the preset storage space. Thus, with the method described above, optimal values of h and w may be traversed.
According to an implementation of the present disclosure, when a to-be-fused sub-node has a plurality of inputs, a maximum storage space occupied by a sum of the output sub-tensor data, the intermediate sub-tensor data, and a plurality of pieces of input sub-tensor data is determined.
As shown in the above, input data of the node add2 includes output data from the node conv4 and output data from the node add1, so in the technical solution of the present disclosure, a total space occupied by a sum of a plurality of pieces of input sub-tensor data as input data is required to be computed, such as a formula madd2=max {mconv4+madd1, h*w*32} shown above.
As shown in
For the operation S1110, referring to
After fusion, split data may be actually run, so that actual running time of the split data may be obtained, which is convenient to evaluate splitting effect of data or fusion effect of neural network operators. In another implementation, instead of the actual operation of the neural network, the running time of the split data in the fusion operator may be estimated through a neural network model.
In another implementation of the present disclosure, after the splitting scheme of the tensor data is determined, the split tensor data may be run in a neural network where operators are fused and a neural network where operators are not fused to compare performance difference between operator fusion and unfusion.
An optimization search sequence Q9={add2, conv4, relu2, conv3, add1, conv2, relu1, conv1, conv_first} is stilled taken as an example for explanation. For example, the nodes add2 and conv4 may be fused, the nodes add2, conv4, and relu2 may be fused, the nodes add2, conv4, relu2, and conv3 may be fused, and the nodes add2, conv4, relu2, conv3, and add1 may be fused, so as to evaluate running time of specific sub-tensors in each fusion scheme, thus obtaining an optimal fusion scheme. Further, the running time of specific sub-tensors in each fusion scheme may be used to determine running time after a candidate node is fused with operators in the second queue, so as to determine whether the candidate node satisfies a preset condition of operator fusion.
Further, according to an implementation of the present disclosure, generating the fusion operator includes generating the code of the fusion operator.
Before fusion, each operator may form the code individually. For example, before fusion, modes of execution of these four operators including add2, conv4, relu2, and conv3 are:
The omission of some parameters in the conv, relu, add2, and slice_padding functions does not affect the description of this implementation.
After fusion, the code of the fusion operator may be generated. A mode of execution for obtaining a new fusion operator block after fusion may be:
The above code is only an illustrative description of the present disclosure, and each individual operator and fusion operator block will be expressed differently depending on different operators and neural network structures.
The present disclosure also provides an electronic device, including: one or a plurality of processors; and a memory, on which a computer-executable instruction is stored, where when the computer-executable instruction is run by the one or the plurality of processors, the electronic device performs the method described above.
The present disclosure also provides a computer-readable storage medium, including a computer-executable instruction, where when the computer-executable instruction is run by one or a plurality of processors, the method described above is performed.
The disclosed technical solution is tested in the following test environment: based on an artificial intelligence chip, on a computer with a Linux operating system, and using a SRResNet supersampling network. Test results show that by using the method of the present disclosure, running time of the whole network is 31 ms, which is ½ of running time when the method is not used. However, by using other existing methods, the running time is greater than 31 ms. Obviously, the solution of the present disclosure significantly improves running efficiency of the neural network.
The solution of the present disclosure provides a data locality optimization method for a deep learning supersampling network. Through operator fusion and data splitting, time of running an image supersampling network on a multilevel on-chip storage processor, especially a graphics processing unit (GPU) or deep learning-dedicated processor, is reduced. The method of the present disclosure may be applied to the GPU and other deep learning-dedicated processors with a multilevel high-speed on-chip memory.
The technical solution of the present disclosure may be applied to the field of artificial intelligence and may be implemented as or may be implemented in an artificial intelligence chip. The chip may stand alone or may be included in a computing processing apparatus.
Other processing apparatus includes one or more types of general-purpose/special-purpose processors, such as a central processing unit (CPU), a graphics processing unit (GPU), a neural network processor, and the like. A count of processors included in other processing apparatus is not limited. Other processing apparatus serves as an interface between a machine learning operation apparatus and external data and control, and completes basic controls, such as moving data, starting and stopping the machine learning operation apparatus; and other processing apparatus may also cooperate with the machine learning operation apparatus to complete an operation task.
The interface apparatus is configured to transfer data and a control instruction between the computing processing apparatus (such as the machine learning operation apparatus) and other processing apparatus. The computing processing apparatus acquires required input data from other processing apparatus and write the data in an on-chip storage apparatus of the computing processing apparatus. The computing processing apparatus may also acquire the control instruction from other processing apparatus and write the control instruction in an on-chip control cache of the computing processing apparatus. The computing processing apparatus may further read data stored in a storage unit of the computing processing apparatus and transfer the data to other processing apparatus.
Optionally, the structure may further include a storage apparatus 1208. The storage apparatus is connected to the computing processing apparatus and other processing apparatus, respectively. The storage apparatus is configured to store data of the computing processing apparatus and other processing apparatus. The storage apparatus is especially suitable for storing data that may not be completely stored in the internal storage of the computing processing apparatus or other processing apparatus of the present disclosure.
The combined processing apparatus may be used as a system on chip (SOC) of a device including a mobile phone, a robot, a drone, a video surveillance device, and the like, which may effectively reduce a core area of a control component, increase processing speed, and reduce overall power consumption. In this situation, the interface apparatus of the combined processing apparatus is connected to some components of the device. The components include, for example, a webcam, a monitor, a mouse, a keyboard, a network card, and a WIFI interface.
In some embodiments, the present disclosure also provides a chip package structure, including the above chip.
In some embodiments, the present disclosure provides a board card, including the above chip package structure.
The storage component is connected to the chip in the chip package structure through a bus and is used to store data. The storage component may include a plurality of groups of storage units 1310. Each group of storage units is connected to the chip through the bus. It may be understood that each group of the storage units may be a double data rate (DDR) synchronous dynamic random access memory (SDRAM).
The DDR doubles the speed of the SDRAM without increasing clock frequency. The DDR allows data to be read on rising and falling edges of a clock pulse. The speed of the DDR is twice of a standard SDRAM. In an embodiment, the storage component may include four groups of storage units. Each group of storage units may include a plurality of DDR4 particles (chips). In an embodiment, four 72-bit DDR4 controllers may be arranged inside the chip, where 64 bits of the 72-bit DDR4 controller described above are used for data transfer, and 8 bits are used for error checking and correcting (ECC) parity. In an embodiment, each group of storage units includes a plurality of DDR SDRAMs arranged in parallel. The DDR may transfer data twice per clock cycle. A controller for controlling the DDR may be arranged in the chip, and the controller is used to control data transfer and data storage of each storage unit.
The external interface apparatus is electrically connected to the chip in the chip package structure. The external interface apparatus is used to implement data transfer between the chip and an external device 1312 (such as a server or a computer). For example, in an embodiment, the external interface apparatus may be a standard peripheral component interconnect express (PCIe) interface. For instance, to-be-processed data is transferred by the server through the standard PCIe interface to the chip, so as to implement data transfer. In another embodiment, the external interface apparatus may also be other interfaces. The present disclosure does not limit specific forms of other interfaces mentioned above, as long as an interface unit may realize a switching function. Additionally, a computing result of the chip is still sent back to the external device (such as the server) through the external interface apparatus.
The control component is electrically connected to the chip. The control component is used to monitor a state of the chip. Specifically, the chip and the control component may be electrically connected through a serial peripheral interface (SPI). The control component may include a micro controller unit (MCU). If the chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, the chip may be capable of driving a plurality of loads. Therefore, the chip may be in different working states, such as a multi-load state and a light-load state. Through the control component, regulation and control of working states of the plurality of processing chips, processing cores and/or processing circuits in the chip may be realized.
In some embodiments, the present disclosure further discloses an electronic device or apparatus, including the above board card.
The electronic device or apparatus may include a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a mobile phone, a traffic recorder, a navigator, a sensor, a webcam, a server, a cloud server, a camera, a video camera, a projector, a watch, a headphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.
The vehicle may include an airplane, a ship, and/or a car. The household appliance may include a television, an air conditioner, a microwave oven, a refrigerator, an electric rice cooker, a humidifier, a washing machine, an electric lamp, a gas cooker, and a range hood. The medical device may include a nuclear magnetic resonance spectrometer, a B-ultrasonic scanner, and/or an electrocardiogra.
It is required to be explained that for the sake of conciseness, the foregoing method embodiments are all described as a series of combinations of actions, but those skilled in the art should know that the present disclosure is not limited by the described order of action since some steps may be performed in a different order or simultaneously according to the present disclosure. Moreover, those skilled in the art should also understand that embodiments described in the specification are all optional, and actions and units involved are not necessarily required for the present disclosure.
In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not described in detail in a certain embodiment, reference may be made to related descriptions in other embodiments.
In several embodiments provided in the present disclosure, it should be understood that the apparatus disclosed may be implemented in other ways. For instance, the apparatus embodiments above are merely illustrative. For instance, a division of units is only a logical function division. In an actual implementation, there may be other division methods. For instance, a plurality of units or components may be combined or may be integrated in another system, or some features may be ignored or may not be performed. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be implemented through indirect coupling or communication connection of some interfaces, apparatuses, or units, and may be in electrical, optical, acoustic, magnetic, or other forms.
Units described as separate components may or may not be physically separated. Components shown as units may or may not be physical units. In other words, the components may be located in one place or distributed to a plurality of network units. According to actual requirements, some or all of the units may be selected for achieving purposes of the embodiments of the present disclosure.
Additionally, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist separately and physically, or two or more units may be integrated in one unit. The integrated unit described above may be implemented either in the form of hardware or in the form of a software program unit. If the integrated unit is implemented in the form of the software program unit and sold or used as an independent product, the integrated unit may be stored in a computer-readable memory. Based on such understanding, when the technical solution of the present disclosure is embodied in the form of a software product, the software product may be stored in a memory. The software product includes several instructions used to enable a computer device (which may be a personal computer, a server, or a network device, and the like) to perform all or part of steps of the method of the embodiments of the present disclosure. The foregoing memory includes: a USB flash drive, a read-only memory (ROM), a random access memory (RAM), a mobile hard disk, a magnetic disk, or an optical disc, and other media that may store program codes.
The embodiments of the present disclosure have been described in detail above. The present disclosure explains principles and implementations of the present disclosure with specific examples. Descriptions of the embodiments above are only used to facilitate understanding of the method and core ideas of the present disclosure. Simultaneously, those skilled in the art may change the specific implementations and application scope of the present disclosure based on the ideas of the present disclosure. In summary, the content of this specification should not be construed as a limitation on the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202110580167.X | May 2021 | CN | national |
This application claims benefit under 35 U.S.C. 119, 120, 121, or 365(c), and is a National Stage entry from International Application No. PCT/CN2022/095109 filed on May 26, 2022, which claims priority to the benefit of Chinese Patent Application No. 202110580167.X filed in the Chinese Intellectual Property Office on May 26, 2021, the entire contents of which are incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2022/095109 | 5/26/2022 | WO |