This disclosure relates to the field of computer technologies, and in particular, to a neural network system and a data processing technology.
Deep learning (DL) is an important branch of artificial intelligence (AI). DL is used for constructing a neural network (NN) through simulating a human brain, to achieve a better recognition effect than a conventional shallow learning manner. A convolutional neural network (CNN) is a most common deep learning architecture, and is also a DL method that is widely studied. A typical processing field of the CNN is image processing. Image processing is an application for recognizing and analyzing an input image and finally outputting a group of classified image content. For example, a CNN algorithm may be used to extract a body color, a license plate number, and a model of a vehicle on an image and perform outputting after classification.
In the CNN, a feature of an image is usually extracted by using a three-layer sequence: a convolutional layer, a pooling layer, and a rectified liner unit (ReLU). A process of extracting a feature of an image is actually a process including a series of matrix operations (for example, a matrix multiply-add operation). Therefore, how to quickly process images in a network in parallel becomes a problem that needs to be studied for the convolutional neural network.
This disclosure provides a neural network system and a data processing technology, to improve a data processing rate in a neural network.
According to a first aspect, an embodiment provides a neural network system. The neural network system includes P computing units configured to perform an operation of a first neural network layer and Q computing units configured to perform an operation of a second neural network layer. The P computing units are configured to: receive first input data, and perform computing on the first input data based on N configured first weights, to obtain first output data. The Q computing units are configured to: receive second input data, and perform computing on the second input data based on M configured second weights, to obtain second output data. The second input data includes the first output data. Herein, P, Q, N, and M are all positive integers, and a ratio of N to M corresponds to a ratio of a data volume of the first output data to a data volume of the second output data.
In the neural network system provided in this embodiment, the N weights are configured for the P computing units for performing the operation of the first neural network layer, and the M weights are configured for the Q computing units for performing the operation of the second neural network layer. The ratio of a value of N to a value of M corresponds to the ratio of the data volume of the first output data to the data volume of the second output data, so that computing capabilities of the P computing units match computing capabilities of the Q computing units. Therefore, a computing capability of a computing node for performing an operation of each neural network layer can be fully used, thereby improving data processing efficiency.
With reference to the first aspect, in a first possible implementation, the neural network system includes a plurality of neural network chips. Each neural network chip includes a plurality of level 2 computing nodes. Each level 2 computing node includes a plurality of computing units. Each computing unit includes at least one resistive random-access memory (ReRAM) crossbar.
With reference to the first aspect or the first possible implementation, in a second possible implementation, values of N and M are determined based on a deployment requirement of the neural network system, the data volume of the first output data, and the data volume of the second output data.
With reference to the second possible implementation, in a third possible implementation, the deployment requirement includes a computing delay. The first neural network layer is a beginning layer of all neural network layers in the neural network system. The value of N is determined based on the data volume of the first output data, the computing delay, and a computing frequency of a ReRAM crossbar; and the value of M is determined based on the value of N and the ratio of the data volume of the first output data to the data volume of the second output data.
In a possible implementation, when the first neural network layer is the beginning layer of all the neural network layers in the neural network system, the value of N may be obtained according to the following formula:
Herein, nreplicas1 is used to indicate a quantity N of weights that need to be configured for the first neural network layer, FMout1·x is a row quantity of output data of the first neural network layer, FMout1·y is a column quantity of the output data of the first neural network layer, t is a specified computing delay, and f is a computing frequency of a crossbar in a computing unit. The value of M may be calculated according to the following formula: N/M=the data volume of the first output data/the data volume of the second output data.
With reference to the second possible implementation, in a fourth possible implementation, the deployment requirement includes a quantity of neural network chips. The first neural network layer is a beginning layer of all neural network layers in the neural network system. The value of N is determined based on the quantity of chips, a quantity of ReRAM crossbars in each chip, a quantity of ReRAM crossbars that are required for deploying a weight of each neural network layer, and a data volume ratio of output data of adjacent neural network layers; and the value of M is determined based on the value of N and the ratio of the data volume of the first output data to the data volume of the second output data.
Specifically, in a possible implementation, when the deployment requirement is the quantity of chips required in the neural network system, and the first neural network layer is the beginning layer in the neural network system, the N first weights that need to be configured for the first neural network layer and the M second weights that need to be configured for the second neural network layer may be obtained according to the following two formulas, where the value of N is a value of nreplicas1, and the value of M is a value of nreplicas2.
Herein, xb1 is used to indicate a quantity of crossbars that are required for deploying a weight of a first (or referred to as a beginning) neural network layer, nreplicas1 is used to indicate a quantity of weights required for the beginning layer, xb2 is used to indicate a quantity of crossbars that are required for deploying a weight of the second neural network layer, nreplicas2 is used to indicate a quantity of weights required for the second neural network layer, xbn is used to indicate a quantity of crossbars that are required for deploying a weight of an nth neural network layer, nreplicasn is used to indicate a quantity of weights required for the nth neural network layer, K is a quantity of chips that are required in the deployment requirement for the neural network system, L is a quantity of crossbars in each chip, nreplicasi is used to indicate a quantity of weights required for an ith layer, nreplicasi-1 is used to indicate a quantity of weights required for an (i−1)th layer, FMouti·x is used to indicate a row quantity of output data of the ith layer, FMouti·y is used to indicate a column quantity of the output data of the ith layer, FMouti-1·x is used to indicate a row quantity of output data of the (i−1)th layer, FMouti-1·y is used to indicate a column quantity of the output data of the (i−1)th layer, a value of i may be from 2 to n, and n is a total layer quantity of the neural network layers in the neural network system.
With reference to any one of the possible implementations of the first aspect, in a fifth possible implementation, at least some of the P computing units and at least some of the Q computing units are located in a same level 2 computing node.
With reference to any one of the possible implementations of the first aspect, in a sixth possible implementation, at least some of level 2 computing nodes to which the P computing units belong and at least some of level 2 computing nodes to which the Q computing units belong are located in a same neural network chip.
With reference to the first aspect or any one of the possible implementations of the first aspect, in another possible implementation, that the ratio of N to M corresponds to the ratio of the data volume of the first output data to the data volume of the second output data includes: The ratio of N to M is the same as the ratio of the data volume of the first output data to the data volume of the second output data.
According to a second aspect, in a data processing method applied to a neural network system, P computing units in the neural network system receive first input data, and perform computing on the first input data based on N configured first weights, to obtain first output data. The P computing units are configured to perform an operation of a first neural network layer. In addition, Q computing units in the neural network system receive second input data, and perform computing on the second input data based on M configured second weights, to obtain second output data. The Q computing units are configured to perform an operation of a second neural network layer. The second input data includes the first output data. Herein, P, Q, N, and M are all positive integers, and a ratio of N to M corresponds to a ratio of a data volume of the first output data to a data volume of the second output data.
With reference to the second aspect, in a first possible implementation, the first neural network layer is a beginning layer of all neural network layers in the neural network system. A value of N is determined based on the data volume of the first output data, a specified computing delay in the neural network system, and a computing frequency of a ReRAM crossbar in a computing unit; and a value of M is determined based on the value of N and the ratio of the data volume of the first output data to the data volume of the second output data.
With reference to the second aspect, in a second possible implementation, the neural network system includes a plurality of neural network chips. Each neural network chip includes a plurality of computing units. Each computing unit includes at least one ReRAM crossbar. The first neural network layer is a beginning layer in the neural network system. A value of N is determined based on a quantity of the plurality of neural network chips, a quantity of ReRAM crossbars in each chip, a quantity of ReRAM crossbars that are required for deploying a weight of each neural network layer, and a volume ratio of output data of adjacent neural network layers; and a value of M is determined based on the value of N and the ratio of the data volume of the first output data to the data volume of the second output data.
With reference to the second aspect or either one of the possible implementations of the second aspect, in a third possible implementation, the neural network system includes the plurality of neural network chips. Each neural network chip includes a plurality of level 2 computing nodes. Each level 2 computing node includes the plurality of computing units. At least some of the P computing units and at least some of the Q computing units are located in a same level 2 computing node.
With reference to the third possible implementation of the second aspect, in another possible implementation, at least some of level 2 computing nodes to which the P computing units belong and at least some of level 2 computing nodes to which the Q computing units belong are located in a same neural network chip.
With reference to the second aspect or any one of the possible implementations of the second aspect, in another possible implementation, the ratio of N to M is the same as the ratio of the data volume of the first output data to the data volume of the second output data.
According to a third aspect, in a computer program product includes program code. An instruction included in the program code is executed by a computer, to implement the data processing method according to the second aspect or any one of the possible implementations of the second aspect.
According to a fourth aspect, a computer-readable storage medium is configured to store a program code. An instruction included in the program code is executed by a computer, to implement the method according to the second aspect or any one of the possible implementations of the second aspect.
To describe technical solutions in embodiments more clearly, the following briefly describes the accompanying drawings for describing the embodiments. The accompanying drawings in the following description show merely some embodiments.
To make a person skilled in the art understand the technical solutions better, the following describes the technical solutions with reference to the accompanying drawings. The described embodiments are merely some, but not all, of the embodiments.
DL is an important branch of AI. The deep learning is used for constructing a neural network through simulating a human brain, to achieve a better recognition effect than a conventional shallow learning manner. An artificial neural network (ANN) is referred to as an NN or an NN-like network. In the field of machine learning and cognitive science, the artificial neural network is a mathematical model or a computing model for simulating a structure and a function of a biological neural network (a central neural system of an animal, especially, a brain), to perform estimation or approximation on a function. The artificial neural network may include a neural network such as a CNN, a deep neural network (DNN), or a multilayer perceptron (MLP).
The host 105 may include a processor 1052 and a memory 1054. It should be noted that, in addition to the components shown in
The processor 1052 is a computing core and a control core of the host 105. The processor 1052 may include a plurality of processor cores. The processor 1052 may be an integrated circuit with an ultra-large scale. An operating system and another software program are installed in the processor 1052, so that the processor 1052 can access the memory 1054, a cache, a magnetic disk, and a peripheral device (for example, the NN circuit in
The memory 1054 is a main memory of the host 105. The memory 1054 is connected to the processor 1052 by using a double data rate (DDR) bus. The memory 1054 is usually configured to store various software running in the operating system, input data and output data, information exchanged with an external memory, and the like. To improve an access rate of the processor 1052, the memory 1054 needs to have an advantage of a high access rate. In a conventional computer system architecture, a dynamic random-access memory (DRAM) is usually used as the memory 1054. The processor 1052 can access the memory 1054 at a high rate by using a memory controller (not shown in
The CNN circuit 110 is a chip array including a plurality of NN chips. For example, as shown in
Further,
It should be noted that when the chips 115 are connected to each other by using the routers, the one or more network topologies including the plurality of routers 120 in the convolutional neural network circuit 110 may be the same as or different from the network topologies including the plurality of routers 122 in the data processing chip 115, provided that data transmission can be performed between the chips 115 or between the tiles 125 by using the network topologies and the chips 115 or the tiles 125 can receive data or output data by using the network topologies. In this embodiment, a quantity of networks including the plurality of routers 120 and 122 and types of the networks are not limited. In addition, in this embodiment, the router 120 may be the same as or different from the router 122. For a clear description, the router 120 connected to the chip and the router 122 connected to the tile are distinguished from each other by using different identifiers in
In actual application, in another case, the chips 115 may be connected to each other by using a high-transport interface instead of the routers 120.
A person skilled in the art may know that, because a novel non-volatile memory such as a ReRAM has an advantage of integrating storage and computing, the non-volatile memory has also been widely applied in a neural network system in recent years. For example, a ReRAM crossbar including a plurality of memristor units (also known as ReRAM cell) may be configured to perform a matrix multiply-add operation in the neural network system. In this embodiment, the engine 1258 may include one or more crossbars. A structure of the ReRAM crossbar may be shown in
In addition, a person skilled in the art may know that the neural network system may include a plurality of neural network layers. In this embodiment, the neural network layer is a concept of a logical layer. One neural network layer indicates that a neural network operation needs to be performed once. Computing of each neural network layer is implemented by a computing node. The neural network layer may include a convolutional layer, a pooling layer, and the like. As shown in
In an existing neural network system, after computing of an ith layer in the neural network is completed, a computing result of the ith layer is temporarily stored in a preset cache. When computing of an (i+1)th layer is performed, a computing unit needs to newly load the computing result of the ith layer and a weight of the (i+1)th layer from the preset cache to perform computing. The ith layer is any layer in the neural network system. In this embodiment, a ReRAM crossbar is applied in the engine of the neural network system, and the ReRAM has an advantage of integrating storage and computing. In addition, a weight may be configured for a ReRAM cell before computing, and a computing result may be directly sent to a next layer for pipeline computing. Therefore, each neural network layer only needs to cache very little data. For example, each neural network layer needs to cache input data enough only for one time of window computing. Further, to implement parallel processing and fast processing on data, an embodiment provides a manner of performing stream processing on data by using a neural network. For a clear description, the following briefly describes stream processing in the neural network system with reference to the convolutional neural network system shown in
As shown in
In addition, a person skilled in the art may know that when performing a neural network operation (for example, convolution computing), a computing node (for example, the tile 125) may perform, based on a weight of a corresponding neural network layer, computing on data input to the computing node. For example, a specific tile 125 may perform, based on a weight of a corresponding convolutional layer, a convolution operation on input data input to the tile 125, for example, perform matrix multiply-add computing on the weight and the input data. The weight is usually used to indicate importance of input data to output data. In a neural network, the weight is usually presented by using a matrix. As shown in
It may be learned from the foregoing description that, in this embodiment, in a process of implementing stream processing in the neural network, the computing nodes in the neural network may be divided into node sets used for processing of different neural network layers, and be configured with corresponding weights. Therefore, the computing nodes in different node sets can perform corresponding computing based on the configured weights. In addition, a computing node in each node set can send a computing result to a computing node configured to perform an operation of a next neural network layer. A person skilled in the art may know that, in the process of implementing the stream processing in the neural network, if computing resources for performing operations of different neural network layers do not match, for example, few computing resources are used for performing an operation of an upper neural network layer but relatively more computing resources are used for performing an operation of a current neural network layer, computing resources of a current-layer computing node are wasted. To fully use a computing capability of a computing node and make computing capabilities of computing nodes for performed operations of different neural network layers match, an embodiment provides a computing resource allocation method, to allocate computing nodes for performing the operations of different neural network layers, so that computing capabilities of computing nodes for performing operations of two adjacent neural network layers match, thereby improving data processing efficiency in the neural network system and avoiding waste of the computing resources.
In step 502, network model information of a neural network system is obtained. The network model information includes a first output data volume of a first neural network layer and a second output data volume of a second neural network layer in the neural network system. The network model information may be determined based on an actual application requirement. For example, a total layer quantity of neural network layers and an algorithm of each layer may be determined according to an application scenario of the neural network system. The network model information may include the total layer quantity of the neural network layers in the neural network system, the algorithm of each layer, and a data output volume of each neural network layer. In this embodiment, the algorithm means a neural network operation that needs to be performed. For example, the algorithm may mean a convolution operation, a pooling operation, and the like. As shown in
In step 504, N first weights that need to be configured for the first neural network layer and M second weights that need to be configured for the second neural network layer are determined based on a deployment requirement of the neural network system, the first output data volume, and the second output data volume. Herein, N and M are both positive integers, and a ratio of N to M corresponds to a ratio of the first output data volume to the second output data volume. In actual application, the deployment requirement may include a computing delay of the neural network system, or may include a quantity of chips that need to be deployed in the neural network system. A person skilled in the art may know that the neural network operation is mainly a matrix multiply-add operation, and output data of each neural network layer is a one-dimensional or multi-dimensional real number matrix. Therefore, the first output data volume includes a row quantity and a column quantity of the output data of the first neural network layer, and the second output data volume includes a row quantity and a column quantity of the output data of the second neural network layer.
As described above, when performing the neural network operation, for example, performing a convolution operation or a pooling operation, the computing node needs to perform multiply-add computing on input data and a weight of a corresponding neural network layer. Because a weight is configured on a cell in a crossbar, and crossbars in a computing unit perform computing on input data in parallel, a quantity of weights may be used for determining a parallel computing capability of a plurality of computing units for performing the neural network operation. In other words, a computing capability of a computing node for performing the neural network operation is determined based on a quantity of weights configured for a computing unit for performing the neural network operation. In this embodiment, to make computing capabilities of two adjacent neural network layers for performing operations in the neural network system match, the quantity of weights that need to be configured for the first neural network layer and the quantity of weights that needs to be configured for the second neural network layer may be determined based on the specific deployment requirement, the first output data volume, and the second output data volume. Because weights of different neural network layers are not necessarily the same, for a clear description, in this embodiment, a weight required for an operation of the first neural network layer is referred to as a first weight, and a weight required for an operation of the second neural network layer is referred to as a second weight. Performing the operation of the first neural network layer means that the computing node performs, based on the first weight, corresponding computing on data input to the first neural network layer, and performing the operation of the second neural network layer means that the computing node performs, based on the second weight, corresponding computing on data input to the second neural network layer. The computing herein may be performing the neural network operation such as a convolution operation or a pooling operation.
The following describes in detail how to determine a quantity of weights that need to be configured for each neural network layer according to different deployment requirements in this step. The quantity of weights that need to be configured for each neural network layer includes a quantity N of first weights that need to be configured for the first neural network layer and a quantity M of second weights that need to be configured for the second neural network layer. In this embodiment, the weight indicates a weight matrix. The quantity of weights is a quantity of required weight matrices or a quantity of weight replicas. The quantity of weights may also be understood as a quantity of same weight matrices that need to be configured.
In a case, when the deployment requirement of the neural network system is a computing delay of the neural network system, in order that computing of the entire neural network system does not exceed the specified computing delay, the quantity of weights that need to be configured for the first neural network layer may be first determined based on the data output volume of the first neural network layer (that is, the beginning layer of all neural network layers in the neural network system), the computing delay, and a computing frequency of a ReRAM crossbar used in the neural network system; and then, the quantity of weights that need to be configured for each neural network layer may be obtained based on an output data volume of each neural network layer and the quantity of weights that need to be configured for the first neural network layer. Specifically, the quantity of weights that need to be configured for the first neural network layer (that is, the beginning layer) may be obtained according to the following Formula (1):
Herein, nreplicas1 is used to indicate the quantity of weights that need to be configured for the first neural network layer (that is, the beginning layer), FMout1·x is the row quantity of the output data of the first neural network layer (that is, the beginning layer), FMout1·y is the column quantity of the output data of the first neural network layer (that is, the beginning layer), t is the specified computing delay, f is the computing frequency of the crossbar in the computing unit. A person skilled in the art may learn that a value of f may be obtained based on a configuration parameter of the used crossbar. The data volume of the output data of the first neural network layer may be obtained based on the network model information obtained in step 502. It may be understood that, when the first neural network layer is the beginning layer of all the neural network layers in the neural network system, the quantity N of first weights is a value of nreplicas1 calculated according to Formula 1.
In this embodiment, after the quantity of weights required for the beginning neural network layer is obtained, a ratio of quantities of weights required for two adjacent layers may correspond to a ratio of output data volumes of the two adjacent layers, to improve data processing efficiency in the neural network system, avoid a bottleneck or data waiting in a parallel pipeline processing manner, and make processing rates of the two adjacent neural network layers match. For example, the two ratios may be the same. Therefore, in this embodiment, the quantity of weights required for each neural network layer may be determined based on the quantity of weights required for the beginning neural network layer and a ratio of output data volumes of two adjacent neural network layers. Specifically, the quantity of weights required for each neural network layer may be calculated according to the following Formula (2).
Herein, nreplicasi is used to indicate a quantity of weights required for an ith layer, nreplicasi-1 is used to indicate a quantity of weights required for an (i−1)th layer, FMouti·x is used to indicate a row quantity of output data of the ith layer, FMouti·y is used to indicate a column quantity of the output data of the ith layer, is used to indicate a row quantity of output data of the (i−1)th layer, is used to indicate a column quantity of the output data of the (i−1)th layer, a value of i may be from 2 to n, and n is a total layer quantity of the neural network layers in the neural network system. In another expression, in this embodiment, a ratio of the quantity of weights required for an operation of the (i−1)th neural network layer to the quantity of weights required for an operation of the ith neural network layer corresponds to a ratio of the output data volume of the (i−1)th layer to the output data volume of the ith layer.
A person skilled in the art may learn that the output data of each neural network layer may include a plurality of channels. Herein, the channel indicates a quantity of kernels at each neural network layer. One kernel represents a feature extraction manner, and corresponds to one feature map. A plurality of feature maps form the output data of this layer. A weight used for one neural network layer includes a plurality of kernels. Therefore, in actual application, in another case, a quantity of channels of each neural network layer may be considered for calculation of the output data volume of each layer. Specifically, after the quantity of weights required for the first neural network layer is obtained according to the foregoing Formula 1, the quantity of weights required for each neural network layer may be obtained according to the following Formula 3.
A difference between Formula 3 and Formula 2 is that a quantity of channels output by each neural network layer is further considered in Formula 3 based on Formula 2. Herein, Ci-1 is used to represent a quantity of channels of the (i−1)th layer, Ci is used to represent a quantity of channels of the ith layer, a value of i is from 2 to n, n is the total layer quantity of the neural network layers in the neural network system, and n is an integer not less than 2. The quantity of channels of each neural network layer may be obtained from the network model information.
In this embodiment, after the quantity of weights required for the beginning layer is obtained according to the foregoing Formula 1, the quantity of weights required for each neural network layer may be calculated based on Formula 2 (or Formula 3) and the output data volume that is of each neural network layer and that is included in the network model information. For example, when the first neural network layer is the beginning layer of all the neural network layers in the neural network system, after the quantity N of first weights is obtained according to Formula 1, the quantity M of second weights required for the second neural network layer may be obtained according to Formula 2 based on the value of N, the specified first output data volume, and the specified second output data volume. In another expression, after the value of N is obtained, the value of M may be calculated according to the following formula: N/M=the first output data volume/the second output data volume.
In another case, when the deployment requirement is a quantity of chips required for the neural network system, the quantity of weights required for the first neural network layer may be calculated with reference to the foregoing Formula 2 and the following Formula 4, or the quantity of weights required for the first neural network layer may be calculated with reference to the foregoing Formula 3 and the following Formula 4.
xb
1
*n
replicas
1
+xb
2
*n
replicas
2
+ . . . +xb
n
*n
replicas
n
≤K*L (Formula 4)
In the foregoing Formula 4, xb1 is used to indicate a quantity of crossbars that are required for deploying a weight of the first neural network layer (or referred to as the beginning layer), nreplicas1 is used to indicate the quantity of weights required for the beginning layer, xb2 is used to indicate a quantity of crossbars that are required for deploying a weight of the second neural network layer, nreplicas2 is used to indicate a quantity of weights required for the second neural network layer, xbn is used to indicate a quantity of crossbars that are required for deploying a weight of an nth neural network layer, nreplicasn is used to indicate a quantity of weights required for the nth neural network layer, K is a quantity of chips that are required in the deployment requirement for the neural network system, and L is a quantity of crossbars in each chip. A sum of quantities of crossbars of the neural network layers in Formula 4 is less than or equal to a specified total quantity of crossbars included in the chip in the neural network. For descriptions of Formula 2 and Formula 3, refer to the foregoing descriptions. Details are not described herein again.
A person skilled in the art may know that after a model of the neural network system is determined, a weight of each neural network layer in the neural network system and a specification of a crossbar (that is, a row quantity and a column quantity of ReRAM cells in the crossbar) used in the neural network system are determined. In another expression, the network model information of the neural network system further includes a size of the weight used for each neural network layer and specification information of the crossbar. Therefore, in this embodiment, xbi of the ith neural network layer may be obtained based on the size of the weight (that is, a row quantity and a column quantity of the weight matrix) of each layer and the specification of the crossbar, where a value of i is from 2 to n. A value of L may be obtained based on a parameter of a chip used in the neural network system. In this embodiment, in a case, after the quantity (that is, nreplicas1) of weights required for the beginning neural network layer is obtained according to the foregoing Formula 2 and Formula 4, the quantity of weights that need to be configured for each layer may be obtained based on Formula 2 and the output data volume that is of each layer and that is obtained from the network model information. In another case, after the quantity (that is, nreplicas1) of weights required for the beginning neural network layer is obtained according to the foregoing Formula 3 and Formula 4, the quantity of weights that need to be configured for each layer may be obtained based on Formula 3 and the output data volume of each layer.
In step 506, the N first weights are deployed for P computing units, and the M second weights are deployed for Q computing units based on a computing specification of a computing unit in the neural network system. Herein, P and Q are both positive integers, the P computing units are configured to perform the operation of the first neural network layer, and the Q computing units are configured to perform the operation of the second neural network layer. In this embodiment, the computing specification of the computing unit indicates a quantity of crossbars included in the computing unit. In actual application, one computing unit may include one or more crossbars. Specifically, as described above, because the network model information of the neural network system further includes the size of the weight used for each neural network layer and the specification information of the crossbar, a deployment relationship between a weight and a crossbar may be obtained. After the quantity of weights that need to be configured for each neural network layer is obtained in step 504, the weight of each layer may be deployed for a corresponding quantity of computing units based on a quantity of crossbars included in each computing unit. Specifically, an element in a weight matrix is deployed for a ReRAM cell in a computing unit. In this embodiment, the computing unit may be a PE or an engine. One PE may include a plurality of engines, and one engine may include one or more crossbars. Because each layer may have different sizes of weights, one weight may be deployed for one or more engines.
Specifically, in this step, the P computing units for which the N first weights need to be deployed and the Q computing units for which the M second weights need to be deployed may be determined based on the deployment relationship between a weight and a crossbar and the quantity of crossbars included in the computing unit. For example, the N first weights of the first neural network layer may be deployed for the P computing units, and the M second weights may be deployed for the Q computing units. Specifically, an element in the N first weights is deployed for a ReRAM cell corresponding to a crossbar in the P computing units, and an element in the M second weights is deployed for a ReRAM cell corresponding to a crossbar in the Q computing units. Therefore, the P computing units may perform, based on the N configured first weights, the operation of the first neural network layer on input data input to the P computing units, and the Q computing units may perform, based on the M configured second weights, the operation of the second neural network layer on input data input to the Q computing units.
It may be learned from the foregoing embodiment that, in the resource allocation method provided in this embodiment, when a computing unit for performing an operation of each neural network layer is configured, a data volume output by an adjacent neural network layer is considered, so that computing capabilities of computing nodes for performing operations of different neural network layers match, thereby fully using computing capabilities of the computing nodes and improving data processing efficiency.
Further, in this embodiment, to further reduce a data transmission volume between the computing units of different neural network layers and save transmission bandwidth between computing units or computing nodes, the computing unit may be mapped to an upper-level computing node of the computing unit by using the following method. As described above, the neural network system may include four levels of computing nodes: a level 1 computing node: a chip, a level 2 computing node: a tile, a level 3 computing node: a PE, and a level 4 computing node: an engine.
In step 602, network model information of a neural network system is obtained. The network model information includes a first output data volume of a first neural network layer and a second output data volume of a second neural network layer in the neural network system. In step 604, N first weights that need to be configured for the first neural network layer and M second weights that need to be configured for the second neural network layer are determined based on a deployment requirement of the neural network system, the first output data volume, and the second output data volume. In step 606, P computing units for which the N first weights need to be deployed and Q computing units for which the M second weights need to be deployed are determined based on a computing specification of a computing unit in the neural network system. For steps 602, 604, and 606 in this embodiment, refer to related descriptions of the foregoing steps 502, 504, and 506. A difference between step 606 and step 506 is that, in step 606, after the P computing units for which the N first weights need to be deployed and the Q computing units for which the M second weights need to be deployed are determined, the N first weights are not directly deployed for the P computing units, nor the M second weights are not directly deployed for the Q computing units. Instead, step 608 is performed.
In step 608, the P computing units and the Q computing units are mapped to a plurality of level 3 computing nodes based on a quantity of computing units included in a level 3 computing node in the neural network system. Specifically,
In step 6082, the P computing units and the Q computing units are divided into m groups. Each group includes P/m computing units for execution at the first neural network layer and Q/m computing units for execution at the second neural network layer. Herein, m is an integer not less than 2, and values of P/m and Q/m are both integers. Specifically, for example, the P computing units are computing units for execution at an (i−1)th layer, and the Q computing units are computing units for execution at an ith layer. As shown in
In step 6084, each group of computing units are separately mapped to a level 3 computing node based on a quantity of computing units included in a level 3 computing node. In a mapping process, computing units for performing operations of adjacent neural network layers are mapped, as much as possible, to a same level 3 node. As shown in
In this mapping manner, computing units for execution at adjacent neural network layers (for example, the ith layer and the (i+1)th layer in
Back to
In step 614, the N first weights and the M second weights are separately deployed for the P computing units and the Q computing units corresponding to the plurality of level 3 nodes, the plurality of level 2 computing nodes, and the plurality of level 1 computing nodes. In this embodiment, a mapping relationship from the level 1 computing node: a chip, to the level 4 computing node: an engine, in the neural network system may be obtained by using the method shown in
In this deployment manner, not only computing capabilities of computing units that support operations of adjacent neural network layers in the neural network system described in this embodiment can match, but also the computing units for performing the operations of the adjacent neural network layers can be located in a same level 3 computing node as many as possible. In addition, level 3 computing nodes for performing the operations of the adjacent neural network layers are located in a same level 2 computing node as many as possible, and level 2 computing nodes for performing the operations of the adjacent neural network layers are located in a same level 1 computing node as many as possible, thereby reducing a data volume transmitted between computing nodes and improving a rate of data transmission between different neural network layers.
It should be noted that, in this embodiment, in a network neural system including four levels of computing nodes, a case in which a level 4 computing node: an engine is used as a computing unit is used to describe a process of allocating computing resources used for performing operations of different neural network layers. In another expression, in the foregoing embodiment, an engine is used as a division granularity of sets for performing the operations of different neural network layers. In actual application, allocation may be alternatively performed by using the level 3 computing node: a PE as a computing unit. In this case, mapping between the level 3 computing node: a PE, the level 2 computing node: a tile, and the level 1 computing node: a chip may be established by using the foregoing method. Certainly, when a very large data volume needs to be computed, the allocation may also be performed by using the level 2 computing node: a tile as a granularity. In another expression, in this embodiment, the computing unit may be an engine, a PE, a tile, or a chip. This is not limited herein.
The foregoing describes in detail how to configure computing resources in the neural network system provided in the embodiments. The following further describes the neural network system from a perspective of data processing.
In step 802, P computing units in the neural network system receive first input data. The P computing units are configured to perform an operation of a first neural network layer in the neural network system. In this embodiment, the first neural network layer is any layer in the neural network system. The first input data is data on which the operation of the first neural network layer needs to be performed. When the first neural network layer is the first layer 302 in the neural network system shown in
In step 804, the P computing units perform computing on the first input data based on N configured first weights, to obtain first output data. In this embodiment, the first weight is a weight matrix. The N first weights indicate N weight matrices, and the N first weights may also be referred to as N first weight replicas. The N first weights may be configured for the P computing units by using the methods shown in
In step 806, Q computing units in the neural network system receive second input data. The Q computing units are configured to perform an operation of a second neural network layer in the neural network system. The second input data includes the first output data. Specifically, in a case, the Q computing units may perform the operation of the second neural network layer only on the first output data of the P computing units. For example, the P computing units are configured to perform the operation of the first layer 302 shown in
In step 808, the Q computing units perform computing on the second input data based on M configured second weights, to obtain second output data. In this embodiment, the second weight is also a weight matrix. The M second weights indicate M weight matrices, and the M second weights may also be referred to as M second weight replicas. Similar to step 804, according to the method shown in
For a clear description, the following briefly describes how the ReRAM crossbar implements a matrix multiply-add operation. As shown in
As described above, because a data volume output by an adjacent neural network layer is considered when a computing unit for performing an operation of each neural network layer is configured in the neural network system, computing capabilities of computing nodes for performing operations of different neural network layers match. Therefore, in the data processing method according to this embodiment, the computing capability of the computing unit can be fully used, thereby improving data processing efficiency of the neural network system.
In another case, an embodiment provides a resource allocation apparatus. The apparatus may be applied to the neural network systems shown in
The obtaining module 1102 is configured to obtain a data volume of first output data of a first neural network layer and a data volume of second output data of a second neural network layer in a neural network system. Input data of the second neural network layer includes the first output data. The computing module 1104 is configured to determine, based on a deployment requirement of the neural network system, N first weights that need to be configured for the first neural network layer and M second weights that need to be configured for the second neural network layer. Herein, N and M are both positive integers, and a ratio of N to M corresponds to a ratio of the data volume of the first output data to the data volume of the second output data.
As described above, the neural network system in this embodiment includes a plurality of neural network chips, each neural network chip includes a plurality of computing units, and each computing unit includes at least one ReRAM crossbar. In a case, the deployment requirement includes a computing delay. When the first neural network layer is a beginning layer of all neural network layers in the neural network system, the computing module is configured to: determine a value of N based on the data volume of the first output data, the computing delay, and a computing frequency of a ReRAM crossbar in the computing unit; and determine a value of M based on the value of N and the ratio of the data volume of the first output data to the data volume of the second output data.
In another case, the deployment requirement includes a quantity of chips in the neural network system. The first neural network layer is a beginning layer in the neural network system. The computing module is configured to: determine the value of N based on the quantity of chips, a quantity of ReRAM crossbars in each chip, a quantity of ReRAM crossbars that are required for deploying a weight of each neural network layer, and a data volume ratio of output data of adjacent neural network layers; and determine the value of M based on the value of N and the ratio of the data volume of the first output data to the data volume of the second output data.
The deployment module 1106 is configured to: based on a computing specification of a computing unit in the neural network system, deploy N first weights for P computing units, and deploy M second weights for Q computing units. Herein, both P and Q are positive integers. The P computing units are configured to perform an operation of the first neural network layer, and the Q computing units are configured to perform an operation of the second neural network layer. Herein, the computing specification of the computing unit indicates a quantity of crossbars included in the computing unit. In actual application, one computing unit may include one or more crossbars. Specifically, after the computing module 1104 obtains a quantity of weights that need to be configured for each neural network layer, the deployment module 1106 may deploy the weight of each layer for a corresponding computing unit based on a quantity of crossbars included in each computing unit. Specifically, each element in a weight matrix is deployed for a ReRAM cell of a crossbar in a computing unit. In this embodiment, the computing unit may be a PE or an engine. One PE may include a plurality of engines, and one engine may include one or more crossbars. Because each layer may have different sizes of weights, one weight may be deployed for one or more engines.
As described above, the neural network system shown in
Further, the mapping module 1108 is further configured to map the plurality of level 2 computing units to which the P computing units and the Q computing units are mapped, to a plurality of neural network chips based on a quantity of level 2 computing units included in each neural network chip. Herein, at least some of level 2 computing nodes to which the P computing units belong and at least some of level 2 computing nodes to which the Q computing units belong are mapped to a same neural network chip.
In this embodiment, refer to corresponding descriptions in
An embodiment further provides a computer program product for implementing the foregoing resource allocation method. In addition, an embodiment provides a computer program product for implementing the foregoing data processing method. Each of the computer program products includes a computer readable storage medium that stores program code. An instruction included in the program code is used for executing the method procedure described in any one of the foregoing method embodiments. An ordinary person skilled in the art may understand that the foregoing storage medium may include any non-transitory machine-readable medium capable of storing program code, such as a USB flash drive, a removable hard disk, a magnetic disk, an optical disc, a random-access memory (RAM), a solid-state drive (SSD), or a non-volatile memory.
It should be noted that the embodiments provided in this application are merely examples. A person skilled in the art may clearly know that, for convenience and conciseness of description, in the foregoing embodiments, the embodiments emphasize different aspects, and for a part not described in detail in one embodiment, refer to relevant description of another embodiment. The features disclosed in the embodiments, claims, and the accompanying drawings may exist independently or exist in a combination. Features described in a hardware form in the embodiments may be executed by software, and vice versa. This is not limited herein.
This is a continuation of Int'l Patent App. No. PCT/CN2018/125761, filed on Dec. 29, 2018, which is incorporated by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2018/125761 | Dec 2018 | US |
Child | 17360459 | US |