This application relates to the field of artificial intelligence technologies, and in particular, to a data processing method and apparatus.
In recent years, the development of deep learning has led to research on a deep learning-based wireless communication technology in the academia and industry. A research result proves that a deep learning technology can improve performance of a wireless communication system, and has potential to be applied to a physical layer to perform interference adjustment, channel estimation, signal detection, signal processing, and other aspects.
A conventional communication transceiver design may be replaced by an autoencoder, a transmit end and a receive end are modeled using a neural network, data distribution is learned by using a large quantity of training samples, and a result is predicted. For example, a neural network may be trained according to a back propagation (back propagation, BP) algorithm. A learning process of the BP algorithm includes a forward propagation process and a back propagation process. In the forward propagation process, input information is processed by an input layer and a hidden layer in sequence and then is sent to an output layer to obtain an excitation response. In the back propagation process, a difference between the excitation response and a corresponding expected target output is calculated as an objective function, and partial derivatives of the objective function with respect to weights of neurons are calculated layer by layer, to form a gradient of the objective function with respect to a weight vector, so that the weight can be modified. Learning of the neural network is completed in a weight modification process. When an error reaches an expected value, the learning of the neural network ends. However, in the BP algorithm, there is no corresponding theoretical guidance for selecting a quantity of network layers and a quantity of neurons, and when a network structure is modified, retraining needs to be performed. There is no reliable mathematical interpretability for a network output result. The implementation of the neural network is considered as a “black box”, which cannot be widely recognized in theory. In addition, gradient disappearance or gradient explosion caused by execution of the BP algorithm has not been effectively resolved.
Embodiments of this application disclose a data processing method and apparatus, so that communication overheads can be reduced, a feedforward neural network architecture is more flexible, and a black box problem of a neural network can be interpreted.
A first aspect of embodiments of this application discloses a data processing method, including: determining a feedforward neural network model, where input information of an 1th layer in the feedforward neural network model includes category distribution information of training data and a first data feature, output information of the 1th layer includes a second data feature, the first data feature is an output of an (1-1)th layer, both the first data feature and the second data feature are classification or clustering information representing the training data, and 1 is a positive integer greater than 1; obtaining to-be-processed data of unknown classification or clustering information; and inputting the to-be-processed data into the feedforward neural network model to determine a data feature of the to-be-processed data, where the data feature of the to-be-processed data is the classification or clustering information representing the to-be-processed data, and the data feature of the to-be-processed data is used to determine a classification or clustering result of the to-be-processed data.
In the foregoing method, compared with a BP algorithm in which a transmit-end network needs to be updated through gradient backhaul, the method in embodiments of this application can reduce communication overheads caused by training and interaction, and improve training efficiency. A receive end needs to train only a task-related readout layer network. In addition, a structure of the feedforward neural network is more flexible, and accuracy can be improved by increasing a quantity of network layers. In other words, when a value of 1 is larger, accuracy of the classification or clustering result of the to-be-processed data is higher, thereby avoiding a problem that retraining is needed due to different adaptations to different transmission/receiving networks. In addition, the feedforward neural network model is interpretable, and a black box problem of a neural network can be interpreted. In addition, the output data feature of the to-be-processed data may be used as data preprocessing, and can be used for a subsequent readout layer operation.
In a possible implementation, a dimension of the data feature of the to-be-processed data is related to a data type of the to-be-processed data.
In another possible implementation, when 1=2, and the first data feature is an output of a 1st layer, input information of the 1st layer includes the category distribution information of the training data and the training data, where the training data includes category labels, and the category distribution information of the training data is determined based on the category labels in the training data.
In another possible implementation, the determining a feedforward neural network model includes: obtaining the first data feature Zl-1; and determining network parameters of the 1th layer based on the first data feature Zl-1 and the category distribution information Πi of the training data, where the second data feature is determined based on the first data feature Zl-1 and the network parameters of the 1th layer.
In another possible implementation, that the second data feature is determined based on the first data feature Zl-1 and the network parameters of the 1th layer includes: determining an objective function gradient expression based on the network parameters of the 1th layer and the first data feature Zl-1; and determining the second data feature Zl based on the first data feature Zl-1, the category distribution information Πi of the training data, and the objective function gradient expression.
In another possible implementation, the determining network parameters of the 1th layer based on the first data feature Zl-1 and the category distribution information Πi of the training data includes: determining, based on the first data feature Zl-1 and the category distribution information Πi of the training data, a regularized autocorrelation matrix of data whose category label corresponds to each category in the training data; and determining the network parameters of the 1th layer based on the regularized autocorrelation matrix of the data whose category label corresponds to each category in the training data.
In another possible implementation,
where
mi is a quantity of pieces of data whose category label corresponds to an ith category in m pieces of training data, m=ΣiK mi, K is a quantity of all categories of category labels in the m pieces of training data,
is a weight parameter used to balance quantities of samples of the categories in the training data, Zl-1 is the first data feature, Πi is the category distribution information of the training data, Si is an autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, ϵ is a regularization parameter, Ŝi is a regularized autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, and Uli is network parameters of the ith category of the 1th layer.
In another possible implementation,
where
mi is a quantity of pieces of data whose category label corresponds to an ith category in m pieces of training data, m=ΣiK mi, K is a quantity of all categories of category labels in the m pieces of training data, Zl-1 is the first data feature, Πi is the category distribution information of the training data, I is an identity matrix, Si is an autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, ϵ is a regularization parameter, Ŝi is a regularized autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, S is autocorrelation matrixes of data whose category labels correspond to all the categories in the training data, Ŝ is regularized autocorrelation matrixes of the data whose category labels correspond to all the categories in the training data, and Ali is network parameters of the ith category of the 1th layer.
In another possible implementation, the determining network parameters of the 1th layer based on the first data feature Zl-1 and the category distribution information Πi of the training data includes: determining gradient parameters based on the category distribution information Πi of the training data; and determining the network parameters of the 1th layer based on the first data feature Zl-1 and the gradient parameters.
In another possible implementation,
where
Zl-1 satisfies an energy constraint: Tr(Zl-1 (Zl-1)T)=m(1+σ2d), σ is a Gaussian distribution variance, m is a quantity of samples of the training data, d is a dimension of the training data, Zl-1 is the first data feature, e∈Rm×1 is a column vector whose elements are all 1, Πi is the category distribution information of the training data, Tr( ) represents a trace operation, I is an identity matrix, mi is a quantity of pieces of data whose category label corresponds to an ith category in the m pieces of training data, m=ΣiK mi, K is a quantity of all categories of category labels in the m pieces of training data, Cli is network parameters of the ith category of the 1th layer, and G and Hi are the gradient parameters.
In another possible implementation, the inputting the to-be-processed data into the feedforward neural network model to determine a data feature of the to-be-processed data includes: determining, based on the to-be-processed data and the network parameters of the 1th layer, category distribution information Πli that corresponds to predicted category labels and that is of the to-be-processed data; determining an objective function gradient expression based on the to-be-processed data and the category distribution information Πli that corresponds to the predicted category labels and that is of the to-be-processed data; and determining the data feature of the to-be-processed data based on the to-be-processed data and the objective function gradient expression.
In another possible implementation, the determining, based on the to-be-processed data and the network parameters of the 1th layer, category distribution information that corresponds to predicted category labels and that is of the to-be-processed data includes: determining projections of the predicted category labels in the to-be-processed data on a first category based on the to-be-processed data and the network parameters of the 1th layer, where the first category is any one of a plurality of categories corresponding to the predicted category labels in the to-be-processed data; and determining, based on the projections of the predicted category labels in the to-be-processed data on the first category, the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data.
In another possible implementation,
pli=UliZ; and
where
Z is the to-be-processed data, Uli is network parameters of an ith category of the 1th layer, pli is projections of the predicted category labels in the to-be-processed data on the ith category of the 1th layer, Πli is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, and η is a hyperparameter for controlling estimation confidence.
In another possible implementation, the objective function gradient expression includes:
where
mi is a quantity of pieces of data whose predicted category label is the ith category in m pieces of to-be-processed data, m=ΣiK mi, K is a quantity of all categories of predicted category labels in the m pieces of to-be-processed data,
is a weight parameter used to balance quantities of samples of the predicted categories in the to-be-processed data, Z is the to-be-processed data, Πli is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, Si is an autocorrelation matrix of the data whose predicted category label corresponds to the ith category in the to-be-processed data, and Ŝi is a regularized autocorrelation matrix of the data whose predicted category label corresponds to the ith category in the to-be-processed data.
In another possible implementation,
pil=AilZ; and
where
Z is the to-be-processed data, Ali is network parameters of an ith category of the 1th layer, pil is projections of the predicted category labels in the to-be-processed data on the ith category of the 1th layer, Πli is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, and η is a hyperparameter for controlling estimation confidence.
In another possible implementation, the objective function gradient expression includes:
where
mi is a quantity of pieces of data whose predicted category label is an ith category in m pieces of to-be-processed data,
αi is a weight parameter used to balance quantities of samples of predicted categories in the to-be-processed data, Z is the to-be-processed data, Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, Si is an autocorrelation matrix of the data whose predicted category label corresponds to the ith category in the to-be-processed data, Ŝi is a regularized autocorrelation matrix of the data whose predicted category label corresponds to the ith category in the to-be-processed data, S is autocorrelation matrixes of data whose predicted category labels correspond to all the categories in the to-be-processed data, and Ŝ is regularized autocorrelation matrixes of the data whose predicted category labels correspond to all the categories in the to-be-processed data.
In another possible implementation, the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data includes one or more of the following: distance information, correlation information, differential information, or soft classification information.
In another possible implementation, the determining, based on the to-be-processed data and the network parameters of the 1th layer, category distribution information that corresponds to predicted category labels and that is of the to-be-processed data includes:
Πil=argmin dist(Z, Cil); or
Πil=argmin <Z, Cil>; or
where
Z is the to-be-processed data, Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, Cil is network parameters of an ith category of the 1th layer, Zl is a data feature of the to-be-processed data at the 1th layer, Zl-1 is a data feature of the to-be-processed data at the (1-1)th layer, and <> represents an inner product.
In another possible implementation, the determining an objective function gradient expression based on the to-be-processed data and the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data includes: determining gradient parameters (G and Hi) based on the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data; and determining the objective function gradient expression based on the to-be-processed data and the gradient parameters.
In another possible implementation, the determining gradient parameters based on the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data includes:
G=[g1, g2, . . . , gi]; and
where
Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, Tr( ) represents a trace operation, I is an identity matrix, mi is a quantity of pieces of data of an ith category in m pieces of to-be-processed data, m=ΣiK mi, K is a quantity of all categories of predicted category labels in the m pieces of to-be-processed data, and G and Hi represent the gradient parameters.
In another possible implementation, the objective function gradient expression includes:
where
Z is the to-be-processed data, σ is a Gaussian distribution variance, ϵ is a regularization parameter, I is an identity matrix, G and Hi represent the gradient parameters, and β represents a regularization parameter.
In another possible implementation, the determining the data feature of the to-be-processed data based on the to-be-processed data and the objective function gradient expression includes:
where
Zl is the data feature of the to-be-processed data, ∂L/∂Z is the objective function gradient expression, Zl-1 is the to-be-processed data, and Zl-1 is constrained in (d-1)-dimensional unit sphere space.
In another possible implementation, the method further includes: outputting the data feature of the to-be-processed data.
In another possible implementation, the to-be-processed data of the unknown classification or clustering information is a data feature of third data, the data feature of the third data is determined through another feed forward neural network, input information of an 1th layer in the another feedforward neural network includes category distribution information of training data and a first data feature, output information of the 1th layer includes a second data feature, the first data feature is an output of an (1-1)th layer, both the first data feature and the second data feature are classification or clustering information representing the training data, and 1 is a positive integer greater than 1.
A second aspect of embodiments of this application discloses a data processing apparatus, including: a first determining unit, an obtaining unit, and a second determining unit. The first determining unit is configured to determine a feedforward neural network model, where input information of an 1th layer in the feedforward neural network model includes category distribution information of training data and a first data feature, output information of the 1th layer includes a second data feature, the first data feature is an output of an (1-1)th layer, both the first data feature and the second data feature are classification or clustering information representing the training data, and 1 is a positive integer greater than 1. The obtaining unit is configured to obtain to-be-processed data of unknown classification or clustering information. The second determining unit is configured to input the to-be-processed data into the feedforward neural network model to determine a data feature of the to-be-processed data, where the data feature of the to-be-processed data is the classification or clustering information representing the to-be-processed data, and the data feature of the to-be-processed data is used to determine a classification or clustering result of the to-be-processed data.
In a possible implementation, a dimension of the data feature of the to-be-processed data is related to a data type of the to-be-processed data.
In another possible implementation, when 1=2, and the first data feature is an output of a 1st layer, input information of the 1st layer includes the category distribution information of the training data and the training data, where the training data includes category labels, and the category distribution information of the training data is determined based on the category labels in the training data.
In another possible implementation, the first determining unit is specifically configured to obtain the first data feature Zl-1; and determine network parameters of the 1th layer based on the first data feature Zl-1 and the category distribution information Πi of the training data, where the second data feature is determined based on the first data feature Zl-1 and the network parameters of the 1th layer.
In another possible implementation, the first determining unit is specifically configured to determine an objective function gradient expression based on the network parameters of the 1th layer and the first data feature Zl-1; and determine the second data feature Zl based on the first data feature Zl-1, the category distribution information Πi of the training data, and the objective function gradient expression.
In another possible implementation, the first determining unit is specifically configured to determine, based on the first data feature Zl-1 and the category distribution information Πi of the training data, a regularized autocorrelation matrix of data whose category label corresponds to each category in the training data; and determine the network parameters of the 1th layer based on the regularized autocorrelation matrix of the data whose category label corresponds to each category in the training data.
In another possible implementation,
where
mi is a quantity of pieces of data whose category label corresponds to an ith category in m pieces of training data, m=ΣlK mi, K is a quantity of all categories of category labels in the m pieces of training data,
is a weight parameter used to balance quantities of samples of the categories in the training data, Zl-1 is the first data feature, Πi is the category distribution information of the training data, Si is an autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, ϵ is a regularization parameter, Ŝi is a regularized autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, and Uil is network parameters of the ith category of the 1th layer.
where
mi is a quantity of pieces of data whose category label corresponds to an ith category in m pieces of training data, m=ΣiK mi, K is a quantity of all categories of category labels in the m pieces of training data, Zl-1 is the first data feature, Πi is the category distribution information of the training data, I is an identity matrix, Si is an autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, ϵ is a regularization parameter, Ŝi is a regularized autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, S is autocorrelation matrixes of data whose category labels correspond to all the categories in the training data, Ŝ is regularized autocorrelation matrixes of the data whose category labels correspond to all the categories in the training data, and Ail is network parameters of the ith category of the 1th layer.
In another possible implementation, the first determining unit is specifically configured to determine gradient parameters based on the category distribution information Πi of the training data; and determine the network parameters of the 1th layer based on the first data feature Zl-1 and the gradient parameters.
In another possible implementation,
where
Zl-1 satisfies an energy constraint: Tr(Zl-1(Zl-1)T)=m(1+σ2d), σ is a Gaussian distribution variance, m is a quantity of samples of the training data, d is a dimension of the training data, Zl-1 is the first data feature, e∈Rm×1 is a column vector whose elements are all 1, Πi is the category distribution information of the training data, Tr( ) represents a trace operation, I is an identity matrix, mi is a quantity of pieces of data whose category label corresponds to an ith category in the m pieces of training data, m=ΣiK mi, K is a quantity of all categories of category labels in the m pieces of training data, Cil is network parameters of the ith category of the 1th layer, and G and Hi are the gradient parameters.
In another possible implementation, the second determining unit is specifically configured to determine, based on the to-be-processed data and the network parameters of the 1th layer, category distribution information Πil that corresponds to predicted category labels and that is of the to-be-processed data; determine an objective function gradient expression based on the to-be-processed data and the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data; and determine the data feature of the to-be-processed data based on the to-be-processed data and the objective function gradient expression.
In another possible implementation, the second determining unit is specifically configured to determine projections of the predicted category labels in the to-be-processed data on a first category based on the to-be-processed data and the network parameters of the 1th layer, where the first category is any one of a plurality of categories corresponding to the predicted category labels in the to-be-processed data; and determine, based on the projections of the predicted category labels in the to-be-processed data on the first category, the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data.
In another possible implementation,
pil=UilZ; and
where
Z is the to-be-processed data, Uil is network parameters of an ith category of the 1th layer, pli is projections of the predicted category labels in the to-be-processed data on the ith category of the 1th layer, Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, and η is a hyperparameter for controlling estimation confidence.
In another possible implementation, the objective function gradient expression includes:
where
mi is a quantity of pieces of data whose predicted category label is the ith category in m pieces of to-be-processed data, m=ΣiK mi, K is a quantity of all categories of predicted category labels in the m pieces of to-be-processed data,
is a weight parameter used to balance quantities of samples of the predicted categories in the to-be-processed data, Z is the to-be-processed data, Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, Si is an autocorrelation matrix of the data whose predicted category label corresponds to the ith category in the to-be-processed data, and Ŝi is a regularized autocorrelation matrix of the data whose predicted category label corresponds to the ith category in the to-be-processed data.
In another possible implementation,
pil=AilZ; and
Z is the to-be-processed data, Ail is network parameters of an ith category of the 1th layer, pil is projections of the predicted category labels in the to-be-processed data on the ith category of the 1th layer, Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, and η is a hyperparameter for controlling estimation confidence.
In another possible implementation, the objective function gradient expression includes:
where
mi is a quantity of pieces of data whose predicted category label is an ith category in m pieces of to-be-processed data,
αi is a weight parameter used to balance quantities of samples of predicted categories in the to-be-processed data, Z is the to-be-processed data, Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, Si is an autocorrelation matrix of the data whose predicted category label corresponds to the ith category in the to-be-processed data, Ŝi is a regularized autocorrelation matrix of the data whose predicted category label corresponds to the ith category in the to-be-processed data, S is autocorrelation matrixes of data whose predicted category labels correspond to all the categories in the to-be-processed data, and Ŝ is regularized autocorrelation matrixes of the data whose predicted category labels correspond to all the categories in the to-be-processed data.
In another possible implementation, the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data includes one or more of the following: distance information, correlation information, differential information, or soft classification information.
In another possible implementation, Πil=argmin dist (Z, Cil); or
Πil=argmin <Z, Cil>; or
where
Z is the to-be-processed data, Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, Cil is network parameters of an ith category of the 1th layer, Zl is a data feature of the to-be-processed data at the 1th layer, Zl-1 is a data feature of the to-be-processed data at the (1-1)th layer, and <> represents an inner product.
In another possible implementation, the second determining unit is specifically configured to determine gradient parameters (G and Hi) based on the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data; and determine the objective function gradient expression based on the to-be-processed data and the gradient parameters.
In another possible implementation,
G=[g1, g2, . . . , gi]; and
where
Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, Tr( ) represents a trace operation, I is an identity matrix, mi is a quantity of pieces of data of an ith category in m pieces of to-be-processed data, m=ΣiK mi, K is a quantity of all categories of predicted category labels in the m pieces of to-be-processed data, and G and Hi represent the gradient parameters.
In another possible implementation, the objective function gradient expression includes:
where
Z is the to-be-processed data, σ is a Gaussian distribution variance, ϵ is a regularization parameter, I is an identity matrix, G and Hi represent the gradient parameters, and β represents a regularization parameter.
In another possible implementation,
where
Zl is the data feature of the to-be-processed data, ∂L/∂Z is the objective function gradient expression, Zl-1 is the to-be-processed data, and Zl-1 is constrained in (d-1)-dimensional unit sphere space.
In another possible implementation, the data processing apparatus further includes an output unit. The output unit is configured to output the data feature of the to-be-processed data.
In another possible implementation, the to-be-processed data of the unknown classification or clustering information is a data feature of third data, the data feature of the third data is determined through another feed forward neural network, input information of an 1th layer in the another feedforward neural network includes category distribution information of training data and a first data feature, output information of the 1th layer includes a second data feature, the first data feature is an output of an (1-1)th layer, both the first data feature and the second data feature are classification or clustering information representing the training data, and 1 is a positive integer greater than 1.
For technical effects brought by the second aspect or the possible implementations, refer to the descriptions of the technical effects brought by the first aspect or the corresponding implementations.
A third aspect of embodiments of this application discloses a data processing apparatus, including at least one processor and a communication interface, where the at least one processor invokes a computer program or instructions stored in a memory, to implement the method according to any one of the foregoing aspects.
A fourth aspect of embodiments of this application discloses a chip system, including at least one processor and a communication interface, where the at least one processor is configured to execute a computer program or instructions, to implement the method according to any one of the foregoing aspects.
A fifth aspect of embodiments of this application discloses a computer-readable storage medium, where the computer-readable storage medium stores computer instructions, and when the computer instructions are run on a processor, the method according to any one of the foregoing aspects is implemented.
A sixth aspect of embodiments of this application discloses a computer program product, where the computer program product includes computer program code, and when the computer program code is run on a computer, the method according to any one of the foregoing aspects is implemented.
A seventh aspect of embodiments of this application discloses a data processing system, including the apparatus according to the second aspect.
The following describes accompanying drawings used in embodiments of this application.
embodiment of this application;
The following describes embodiments of this application with reference to the accompanying drawings in embodiments of this application.
In the specification and accompanying drawings of this application, the terms “first”, “second”, and the like are intended to distinguish between different objects or distinguish between different processing of a same object, but are not used to describe a particular order of the objects. In addition, the terms “including” and “having” and any variations thereof in descriptions of this application are intended to cover a non-exclusive inclusion. For example, a process, a method, a system, a product, or a device that includes a series of steps or units is not limited to the listed steps or units, but optionally further includes other unlisted steps or units, or optionally further includes other inherent steps or units of the process, the method, the product, or the device. It should be noted that in embodiments of this application, the word “an example”, “for example”, or the like is used to represent giving an example, an illustration, or a description. Any embodiment or design scheme described as “example” or “for example” in embodiments of this application should not be explained as being more preferred or having more advantages than another embodiment or design scheme. Specifically, the words such as “example” or “for example” are used to present related concepts in a specific manner. In embodiments of this application, “A and/or B” represents two meanings: A and B, and A or B. “A, and/or B, and/or C” represents any one of A, B, and C, or represents any two of A, B, and C, or represents A, B, and C. The following describes technical solutions of this application with reference to accompanying drawings.
In recent years, the development of deep learning has led to research on a deep learning-based wireless communication technology in the academia and industry. A research result proves that a deep learning technology can improve performance of a wireless communication system, and has potential to be applied to a physical layer to perform interference adjustment, channel estimation, signal detection, signal processing, and other aspects.
A conventional communication transceiver design may be replaced by an autoencoder, a transmit end and a receive end are modeled using a neural network, data distribution is learned by using a large quantity of training samples, and a result is predicted.
To resolve the problem existing in the BP algorithm, embodiments of this application further provide a random feature-based neural network and a metric representation-based neural network. The random feature-based neural network may be an extreme learning machine (extreme learning machine, ELM), and the ELM is a typical learning algorithm of a feedforward neural network. The network usually has one or more hidden-layers, where a parameter of the hidden-layer does not need to be adjusted. Weights of a hidden layer to an output layer need to be determined only by solving one system of linear equations. Therefore, a calculation speed can be improved. Generalization performance of the algorithm is good, and a learning speed of the algorithm is 1000 times faster than that of training using the BP algorithm. However, a wide hidden layer is usually needed to obtain a sufficient quantity of features for representing original data. The metric representation-based neural network may be a neural network training method based on the Hilbert-Schmidt independence criterion (the Hilbert-Schmidt independence criterion, HSIC). The method is trained by using a method approximating to information bottleneck, mutual information between a hidden layer and a label needs to be maximized, and an interdependency between a representation of the hidden layer and an input needs to be minimized. Calculation of the mutual information is difficult in a random variable. Therefore, the HSIC based on a non-parameter kernel method is used, which is more complex than the BP algorithm.
Therefore, to resolve the foregoing problem, embodiments of this application provide a data processing method, and provide a feedforward neural network model, to reduce communication overheads between a transmit end and a receive end caused by BP algorithm training and interaction, and improve training efficiency. In addition, in a scenario of dealing with different transmission/receiving network structures, a quantity of network layers is adjusted to improve training accuracy, to avoid a problem that retraining is needed due to different adaptions to different transmission/receiving networks.
First, an overall working process of an artificial intelligence system is described.
The infrastructure provides computing capability support for the artificial intelligence system, implements communication with the external world, and implements support by using a basic platform. A sensor is used to communicate with the outside. A computing capability is provided by an intelligent chip (a hardware acceleration chip like a CPU, an NPU, a GPU, an ASIC, or an FPGA). The basic platform includes related platforms such as a distributed computing framework and a network for assurance and support, and may include cloud storage and computing, an interconnection network, and the like. For example, the sensor communicates with the outside to obtain data, and the data is provided to the intelligent chip in a distributed computing system provided by the basic platform for computing.
Data at an upper layer of the infrastructure indicates a data source in the artificial intelligence field. The data relates to a graph, an image, a voice, and a text, further relates to Internet of things data of a conventional device, and includes service data of an existing system and perception data such as force, displacement, a liquid level, a temperature, and humidity.
Data processing usually includes data training, machine learning, deep learning, searching, inference, decision-making, and the like.
Machine learning and deep learning may perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, and the like on data.
Inference is a process in which human intelligent inference is simulated in a computer or an intelligent system, and machine thinking and problem resolving are performed by using formal information according to an inference control policy. Typical functions are searching and matching.
Decision-making is a process of making a decision after intelligent information is inferred, and usually provides functions such as classification, ranking, and prediction.
After data processing mentioned above is performed on the data, some general capabilities may be further formed based on a data processing result. For example, the general capabilities may be an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, and image recognition.
The intelligent product and industry application are products and applications of the artificial intelligence system in various fields. The intelligent product and industry application involve packaging overall artificial intelligence solutions, to productize and apply intelligent information decision-making. Application fields of the intelligent information decision-making mainly include: an intelligent terminal, intelligent transportation, intelligent healthcare, autonomous driving, a safe city, and the like.
Embodiments of this application are mainly applied to fields such as driver assistance, autonomous driving, and a mobile phone terminal.
The following describes several application scenarios:
In the ADAS and the ADS, a plurality of types of 2D targets need to be detected in real time, and include a dynamic obstacle (a pedestrian (Pedestrian), a cyclist (Cyclist), a tricycle (Tricycle), a car (Car), a truck (Truck), or a bus (Bus)), a static obstacle (a traffic cone (TrafficCone), a traffic stick (TrafficStick), a fire hydrant (FireHydrant), a motorcycle (Motorcycle), or a bicycle (Bicycle)), or a traffic sign (TrafficSign) (a guide sign (GuideSign), a billboard (Billboard), a red traffic light (TrafficLight_Red)/yellow traffic light (TrafficLight_Yellow)/green traffic light (TrafficLight_Green)/black traffic light (TrafficLight_Black), or a road sign (RoadSign)). In addition, to accurately obtain an area occupied by the dynamic obstacle in 3D space, 3D estimation further needs to be performed on the dynamic obstacle, to output a 3D box. To integrate with data of laser radar, a mask of the dynamic obstacle needs to be obtained to filter out laser point clouds that hit the dynamic obstacle. To accurately locate parking space, four keypoints of the parking space need to be detected at the same time. To locate a composition, key points of static objects need to be detected. This is a semantic segmentation issue. A camera of an autonomous driving vehicle captures a road picture, and the picture needs to be segmented into different objects such as a road surface, a roadbed, a vehicle, and a pedestrian, to keep the vehicle driving in a correct area. A picture needs to be understood in real time for autonomous driving having an extremely high requirement on security. A feedforward neural network that can run in real time and can perform semantic segmentation is critical.
After obtaining a to-be-classified image, an object recognition apparatus processes an object in the to-be-classified image through a classification model obtained through training based on the data processing method in embodiments of this application, to obtain a category of the object in the to-be-classified image, and then may classify the to-be-classified image based on the object category of the object in the to-be-classified image. A photographer takes many photos every day, such as photos of animals, photos of people, and photos of plants. According to the method in this application, the photos can be quickly classified based on content in the photos, and may be classified into photos including animals, photos including people, and photos including plants.
When there are a large quantity of images, efficiency of a manual classification manner is low, and a person is prone to fatigue when processing a same thing for a long time. In this case, a classification result has a large error.
After obtaining an image of a commodity, the object recognition apparatus processes the image of the commodity by using the classification model obtained through training based on the data processing method in embodiments of this application, to obtain a category of the commodity in the image of the commodity, and then classifies the commodity based on the category of the commodity. For a variety of commodities in a large shopping mall or a supermarket, the commodities can be quickly classified by using the method in this application, to reduce time overheads and labor costs.
This is an image similarity comparison issue. When a passenger performs face authentication at a high-speed railway or airport entrance gate, a camera captures a face image. The method in embodiments of this application is used to extract a feature, and calculate a similarity between the extracted feature and an image feature of an identification card stored in a system. If the similarity is high, the authentication succeeds. Face verification can be quickly performed by using the method in this application.
This is a speech recognition and machine translation issue. In terms of the speech recognition and machine translation issue, a feedforward neural network is also a common recognition model. In a scenario in which simultaneous interpretation is needed, real-time speech recognition and interpretation need to be implemented. An efficient feedforward neural network can provide better experience for a translation machine.
A feedforward neural network model trained in embodiments of this application may implement the foregoing functions.
The following describes a system architecture provided in embodiments of this application.
Specifically, the trained feedforward neural network model can be used to implement the data processing method provided in embodiments of this application.
It should be noted that, in actual application, the training data maintained in the database 130 is not necessarily all collected by the data collection device 160, and may be partially received from another device. In addition, it should be noted that the training device 120 does not necessarily train the feedforward neural network model completely based on the training data maintained in the database 130, and may perform model training by using training data obtained from a cloud or another place. The foregoing descriptions should not be construed as a limitation on embodiments of this application.
A target model/rule 101 obtained through training by the training device 120 may be applied to different systems or devices, for example, applied to an execution device 110 shown in
In a process in which the execution device 110 preprocesses the input data, or in a process in which a computation module 111 of the execution device 110 performs related processing like computation (for example, performs function implementation of the feedforward neural network in this application), the execution device 110 may invoke data, code, and the like in a data storage system 170 for corresponding processing, and may further store, in the data storage system 170, data, instructions, and the like that are obtained through the corresponding processing.
Finally, the I/O interface 112 returns a processing result such as an image, video, or voice recognition result or classification result to the client device 140, so that the client device 140 can provide the result to a user device 150. The user device 150 may be a lightweight terminal that needs to use the target model/rule 101, for example, a mobile phone terminal, a notebook computer, an AR/VR terminal, or a vehicle-mounted terminal, to respond to a corresponding requirement of a terminal user, for example, perform image recognition on an image input by the terminal user and output a recognition result to the terminal user, or classify a text input by the terminal user and output a classification result to the terminal user.
It should be noted that the training device 120 may generate corresponding target models/rules 101 for different targets or different tasks based on different training data. The corresponding target models/rules 101 may be used to achieve the foregoing targets or complete the foregoing tasks, to provide a needed result for the user.
In a case shown in
After receiving the output result, the client device 140 may transmit the result to the user device 150. The user device 150 may be a terminal, for example, a mobile phone terminal, a tablet computer, a notebook computer, AR/VR, or a vehicle-mounted terminal. In an example, the user device 150 may run the target model/rule 101 to implement a specific function.
It should be noted that
As shown in
The following describes a diagram of a structure of a feed forward neural network according to an embodiment of this application.
As shown in
A specific computation process of the 1st layer is shown in
A specific computation process of the 2nd layer is shown in
A specific computation process of the 3rd layer is shown in
Then, the network parameters of each layer, for example, Uil, Ui2, and Ui3, are stored as a d×d fully connected layer parameter, to obtain a trained feedforward neural network model. The specific deduction process is as follows:
A specific process of obtaining to-be-processed data of unknown classification or clustering information; and inputting the to-be-processed data into the trained feedforward neural network model to obtain a data feature of the to-be-processed data is as follows.
A specific computation process of the 1st layer is shown in
A specific computation process of the 2nd layer is shown in
A specific computation process of the 3rd layer is shown in
The following describes a hardware structure of a chip provided in embodiments of this application.
The artificial intelligence processor 50 may be any processor suitable for large-scale exclusive OR operation processing, for example, a neural network processing unit (network processing unit, NPU), a tensor processing unit (tensor processing unit, TPU), or a graphics processing unit (graphics processing unit, GPU). The NPU is used as an example. The NPU may be mounted, as a coprocessor, onto a host CPU (Host CPU), and the host CPU allocates a task to the NPU. A core part of the NPU is an operation circuit 503, and a controller 504 controls the operation circuit 503 to extract data in a memory (a weight memory or an input memory) and perform an operation.
In some implementations, the operation circuit 503 includes a plurality of processing units (processing engines, PEs). In some implementations, the operation circuit 503 is a two-dimensional systolic array. The operation circuit 503 may alternatively be a one-dimensional systolic array or another electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 503 is a general purpose matrix processor.
For example, it is assumed that there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit 503 obtains data corresponding to the matrix B from the weight memory 502, and buffers the data on each PE in the operation circuit 503. The operation circuit 503 obtains input data of the matrix A from the input memory 501, performs a matrix operation on the input data of the matrix A and the weight data of the matrix B, and stores an obtained partial result or final result of the matrices in an accumulator (accumulator) 508.
A unified memory 506 is configured to store input data and output data. The weight data is transferred to the weight memory 502 through a direct memory access controller (direct memory access controller, DMAC) 505. The input data is also transferred to the unified memory 506 through the DMAC.
A bus interface unit (bus interface unit, BIU) 510 is configured to implement interaction between the DMCA and an instruction fetch buffer (instruction fetch buffer) 509. The bus interface unit 510 is further used by the instruction fetch buffer 509 to obtain instructions from an external memory. The bus interface unit 510 is further used by the direct memory access controller 505 to obtain original data of the input matrix A or original data of the weight matrix B from the external memory.
The DMAC is mainly configured to transfer input data in the external memory DDR to the unified memory 506, or transfer weight data to the weight memory 502, or transfer input data to the input memory 501.
A vector calculation unit 507 may include a plurality of operation processing units. If needed, further processing is performed on an output of the operation circuit 503, for example, vector multiplication, vector addition, an exponential operation, a logarithmic operation, or size comparison. The vector calculation unit 507 is mainly configured to perform intermediate-layer calculation in the feedforward neural network.
In some implementations, the vector calculation unit 507 stores a processed output vector in the unified memory 506. For example, the vector calculation unit 507 may apply a non-linear function to the output of the operation circuit 503, for example, a vector of an accumulated value, to generate an activation value. In some implementations, the vector calculation unit 507 generates a normalized value, a combined value, or both a normalized value and a combined value. In some implementations, the processed output vector can be used as an activation input to the operation circuit 503, for example, used at a subsequent layer in the feedforward neural network.
The instruction fetch buffer (instruction fetch buffer) 509 connected to the controller 504 is configured to store instructions used by the controller 504.
The controller 504 is configured to invoke the instructions buffered in the instruction fetch buffer 509, to control a working process of the operation accelerator.
Generally, the unified memory 506, the input memory 501, the weight memory 502, and the instruction fetch buffer 509 are all on-chip (On-Chip) memories. The external memory is a memory outside the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM), a high bandwidth memory (high bandwidth memory, HBM) or another readable and writable memory.
The execution device 110 in
Embodiments of this application provide a system architecture. The system architecture includes one or more local devices, an execution device, and a data storage system. The local device is connected to the execution device through a communication network.
The execution device may be implemented by one or more servers. Optionally, the execution device may cooperate with another computation device, for example, a device such as a data memory, a router, or a load balancer. The execution device may be deployed on one physical site, or distributed on a plurality of physical sites. The execution device may implement the data processing method in embodiments of this application by using data in the data storage system or by invoking program code in the data storage system.
A user may operate a respective user device (for example, one or more local devices) to interact with the execution device. Each local device may represent any computation device, for example, a personal computer, a computer workstation, a smartphone, a tablet computer, an intelligent camera, a smart automobile, another type of cellular phone, a media consumption device, a wearable device, a set-top box, or a game console.
The local device of each user may interact with the execution device through a communication network of any communication mechanism/communication standard. The communication network may be a wide area network, a local area network, a point-to-point connection, or any combination thereof.
In an implementation, the local device obtains a related parameter of a target neural network from the execution device, deploys the target neural network on the local device and the local device, and performs image classification, image processing, or the like through the target neural network. The target neural network is obtained through training according to the data processing method in embodiments f this application.
In another implementation, the target neural network may be directly deployed on the execution device. The execution device obtains to-be-processed data from the local device and the local device, and performs classification or another type of processing on the to-be-processed data based on the target neural network.
The execution device may also be referred to as a cloud device. In this case, the execution device is usually deployed on a cloud.
The following describes some terms in this application for ease of understanding.
It is assumed that there are m pieces of sampled data, and a dimension of each piece of sampled data is d. In this case, the sampled data Z=[X1, X2, . . . , Xm] ∈Rd×m, and an autocorrelation matrix S of the sampled data Z may be used as an important parameter for representing distribution of the sampled data. A calculation formula of the autocorrelation matrix S of the sampled data Z is specifically as follows:
where
S is the autocorrelation matrix of the sampled data, m is a quantity of pieces of sampled data, and Z is the sampled data.
For Z, the autocorrelation matrix S is an unbiased estimation and is a positive definite matrix. Similarly, a type of autocorrelation matrix may be defined as:
where
Si is an autocorrelation matrix of data whose category label corresponds to an ith category in the sampled data, mi represents a quantity of pieces of data whose category label corresponds to the ith category in the sampled data and therefore m=ΣiKmi, K is a quantity of all categories of category labels in the m pieces of sampled data, Πi is category distribution information of the data whose category label corresponds to the ith category in the sampled data, and Z is the sampled data.
If autocorrelation matrixes of two random variables comply with a high-dimensional
normal distribution, a KL (Kullback-Leibler) divergence between the two matrixes may be defined as:
DKL(Si||Sj) is a KL divergence between the autocorrelation matrix of data whose category label corresponds to the ith category in the sampled data and an autocorrelation matrix of data whose category label corresponds to a jth category in the sampled data, Si is the autocorrelation matrix of the data whose category label corresponds to the ith category in the sampled data, Sj is the autocorrelation matrix of the data whose category label corresponds to the jth category in the sampled data, d is a dimension of the sampled data, Tr( ) represents a trace operation, and logdet( ) represents a logarithm of a determinant of a matrix.
Because the KL divergence is asymmetric, to meet symmetry of a distance measurement, the JS (Jensen-Shannon) divergence may be further used. In this case, the JS divergence between two matrixes may be defined as follows:
DJS(Si||Sj) is a JS divergence between the autocorrelation matrix of data whose category label corresponds to the ith category in the sampled data and the autocorrelation matrix of data whose category label corresponds to the jth category in the sampled data, Si is the autocorrelation matrix of the data whose category label corresponds to the ith category in the sampled data, Sj is the autocorrelation matrix of the data whose category label corresponds to the jth category in the sampled data, dis a dimension of the sampled data, and Tr( ) represents a trace operation.
Therefore, an objective function may be determined to perform an operation to expand a JS divergence between autocorrelation matrixes of sampled data of different categories, so as to distinguish between the sampled data of the different categories, so that a classification/clustering effect is achieved. Specifically, an expression of the objective function is as follows:
αi,j is a weight parameter used to balance quantities of pieces of sampled data of categories, mi represents a quantity of pieces of data whose category label corresponds to an ith category in the sampled data, mj represents a quantity of pieces of data whose category label corresponds to a jth category in the sampled data, and DJS(Si||Sj) is a JS divergence between an autocorrelation matrix of the data whose category label corresponds to the ith category in the sampled data and an autocorrelation matrix of the data whose category label corresponds to the jth category in the sampled data.
The objective function may be used for network update. To implement a feedforward neural network, a data feature Z may be updated in a gradient ascending manner. Details are as follows.
Zl represents a data feature of an 1th layer in the feedforward neural network, Zl-1 represents a data feature of an (1-1)th layer in the feedforward neural network, ∂L/∂Z represents a gradient expression of the objective function, and A represents a step or a learning rate.
An objective function gradient expression may be determined based on the objective function. Details are as follows.
αi,j is a weight parameter used to balance quantities of samples of the categories in the sampled data, Ŝi is a regularized autocorrelation matrix of the data whose category label corresponds to the ith category in the sample data, Ŝi=ϵI+Si, Si is the autocorrelation matrix of the data whose category label corresponds to the ith category in the sample data, ϵ is a regularization parameter, I is an identity matrix, mi represents a quantity of pieces of the data whose category label corresponds to the ith category in the sampled data and therefore m=ΣiK mi, K is a quantity of all categories of category labels in the m pieces of sampled data, Πi is category distribution information of the data whose category label corresponds to the ith category in the sampled data, Ŝj is a regularized autocorrelation matrix of the data whose category label corresponds to the jth category in the sampled data, Ŝj=ϵI+Sj, Sj is the autocorrelation matrix of the data whose category label corresponds to the jth category in the sampled data, m represents a quantity of pieces of the data whose category label corresponds to the jth category in the sampled data, Πj is category distribution information of the data whose category label corresponds to the jth category in the sampled data, and Z is the sampled data.
When original data complies with a probability distribution, a corresponding feature Z of the original data also complies with a probability distribution, which is expressed as:
P(Z) is a hybrid distribution generated by a group of conditional probabilities {P(Z|Z ∈Ck)}, and Ck is category information. When the category information is not given, a random vector Z complies with the distribution P(Z). When the category information Ck is given, the random vector Z complies with the distribution P(Z|Z∈Ck). Therefore, it is expected that an increase of the category information can bring a large change in a feature distribution. A difference between the distribution P(Z|Z∈Ck) and the distribution P(Z) is used as a measure of the feature. Specifically, an expression of the objective function is as follows:
αk is a weight parameter used to balance quantities of pieces of sampled data of categories, m represents a quantity of pieces of data whose category label corresponds to a kth category in the sampled data, Sk is an autocorrelation matrix of a feature Zk complying with the conditional probability distribution P(Z|Z∈Ck), S is an autocorrelation matrix of a feature Z complying with the probability distribution P(Z), and DKL(Sk∥S) is a KL divergence between Sk and S.
An objective function gradient expression may be determined based on the objective function. Details are as follows.
αk is a weight parameter used to balance quantities of pieces of sampled data of categories, mx represents a quantity of pieces of data whose category label corresponds to a kth category in the sampled data, and therefore m=ΣkK mk, K is a quantity of all categories of category labels in m pieces of sampled data, Ŝk is a regularized autocorrelation matrix of the data whose category label corresponds to the kth category in the sampled data, Ŝk=ϵI+Sk, ϵ is a regularization parameter, Sk is an autocorrelation matrix of the data whose category label corresponds to the kth category in the sampled data, I is an identity matrix, Πk is category distribution information of the data whose category label corresponds to the kth category in the sampled data, Ŝ is regularized autocorrelation matrixes of data whose category labels correspond to all the categories in the sampled data, S is autocorrelation matrixes of the data whose category labels correspond to all the categories in the sampled data, and Z is the sampled data.
Feature extraction may be considered as a process of searching for a mapping from original data space to feature space. The contrastive learning (Contrastive Learning) is a feature extraction method, and a core idea of the contrastive learning is that a distance between images of similar original data mapped to the feature space should be as close as possible, and a distance between images of original data varying greatly mapped to the feature space should be as far as possible. Therefore, an objective function may be designed based on the idea of contrastive learning, and the following two principles are specifically followed: (1) Contrast: A distance between central nodes of data classification/clustering should be as large as possible. (2) Diversity: Data should be as diverse as possible in the same classification/clustering.
Details are as follows: For n pieces of data for classification/clustering, according to the contrast principle, if a distance between every two nodes is directly calculated, a calculation amount is O(n2). This is a multi-objective optimization problem, which is difficult to be processed. Therefore, the contrast principle is equivalently described as: maximizing a volume of an n-dimensional simplex spanned from each node under a condition that data energy is fixed. The diversity principle may be described by using entropy, and the diversity principle is described as maximizing entropy of a feature under a condition that classification/clustering information is known. It can be proved that, under a condition that feature energy is fixed, the feature has maximum entropy only when a feature distribution is white Gaussian noise. Therefore, it is expected that the feature distribution is as close as possible to a Gaussian distribution. Similar to the foregoing descriptions, a KL divergence may be used to describe a similarity between the feature distribution and the Gaussian distribution, and an objective function is defined as:
where
a volume of a K-dimensional simplex spanned from a central node is
is a column vector whose elements are all 1, Πk is category distribution information of data whose category label corresponds to a kth category in sampled data, Tr( ) represents a trace operation, Z needs to satisfy an energy constraint Tr(ZZT)=m(1+σ2d), σ is a Gaussian distribution variance, m represents a quantity of pieces of the sampled data, and d is a dimension of the sampled data.
The objective function satisfies convexity and unitary invariance. Therefore, a gradient expression of the objective function is specifically as follows:
is the column vector whose elements are all 1, Πk is the category distribution information of the data whose category label corresponds to the kth category in the sampled data, Tr( ) represents the trace operation, Z needs to satisfy the energy constraint Tr(ZZT)=m(1+σ2d), σ is the Gaussian distribution variance, m represents the quantity of pieces of the sampled data, mk is a quantity of pieces of data whose category label corresponds to the kth category in the m pieces of sampled data, m=ΣkK mk, K is a quantity of all categories of category labels in the m pieces of sampled data, I is an identity matrix, β represents a regularization parameter, and Z is the sampled data.
The following describes in detail a method in embodiments of this application.
Specifically, the training data includes category labels. In an example, it is assumed that there are m=100 pictures in the training data, pictures 1 to 10 are category 1, that is, a number “0” category, pictures 11 to 20 are category 2, that is, a number “1” category, pictures 21 to 30 are category 3, that is, a number “2” category, . . . , and pictures 91 to 100 are category 10, that is, a number “9” category.
In a possible implementation, category distribution information of the training data may be determined based on the category labels in the training data.
Specifically, a general classification or clustering task may have m pieces of d-dimensional data, which are represented as a feature matrix Z∈Rd×m and have K classifications/clusters, that is, C1, . . . , CK. When soft classification/clustering is considered, a specific definition may be as follows:
It can be learned that Πk is a diagonal matrix having values only on a diagonal line, and ΣkΠk=Im×m. Πk indicates distribution information of each category in data, and distribution information in a training set should be the same as distribution information in a test set. Therefore, category distribution information of original data may be obtained by estimating a parameter Πk, and then feature extraction is performed on the data by using the category distribution information of the original data.
In an example, an MNIST dataset is used as an example. It is assumed that m=100 pieces of picture data are sampled from the dataset, each picture includes d=28*28-dimensional pixels whose values are within [0, 1], and the 100 pictures are training data. In this case, Z represents a feature matrix including such a group of training data. There are K=10 categories. It is assumed that pictures 1 to 10 are a number “0” category, and pictures 11 to 20 are a number “1” category. Therefore, it may be determined that for the number “0” category, Π01 to Π010 are 1, and the rest are 0. In this case, there is distribution information Π01=diag(1,1,1,1,1,1,1,1,1,1,0, . . . ,0) of the number “0” category in the training data. The other categories are similar.
Specifically, input information of an 1th layer in the feedforward neural network model includes category distribution information of the training data and a first data feature, output information of the 1th layer includes a second data feature, the first data feature is an output of an (1-1)th layer, both the first data feature and the second data feature are classification or clustering information representing the training data, and 1 is a positive integer greater than 1.
In a possible implementation, when 1=2, and the first data feature is an output of a 1st layer, input information of the 1st layer includes the category distribution information of the training data and the training data. In other words, when 1=2, the input information of the 1st layer includes the category distribution information of the training data and the training data, the output of the 1st layer is the first data feature, input information of a 2nd layer includes the category distribution information of the training data and the first data feature, and output information of the 2nd layer includes the second data feature. An input dimension of an input dataset X may be reduced to a d dimension through feature engineering, to obtain training data as an input. In this embodiment of this application, the input dimension of the input dataset X is the same as a dimension of the training data.
In a possible implementation, the determining a feedforward neural network model includes: obtaining the first data feature Zl-1; and determining network parameters of the 1th layer based on the first data feature Zl-1 and the category distribution information Πi of the training data, where the second data feature is determined based on the first data feature Zl-1 and the network parameters of the 1th layer.
Specifically, the determining network parameters of the 1th layer based on the first data feature Zl-1 and the category distribution information Πi of the training data specifically includes the following manners: determining, based on the first data feature Zl-1 and the category distribution information Πi of the training data, a regularized autocorrelation matrix of data whose category label corresponds to each category in the training data; and determining the network parameters of the 1th layer based on the regularized autocorrelation matrix of the data whose category label corresponds to each category in the training data.
A specific formula is as follows:
where
mi is a quantity of pieces of data whose category label corresponds to an ith category in m pieces of training data, m=ΣiK mi, K is a quantity of all categories of category labels in the m pieces of training data,
is a weight parameter used to balance quantities of samples of the categories in the training data, Zl-1 is the first data feature, Πi is the category distribution information of the training data, Si is an autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, ϵ is a regularization parameter, Ŝi is a regularized autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, and Uil is network parameters of the ith category of the 1th layer.
It is determined based on the foregoing formula that if a value of Uil is smaller, it indicates that distribution of the ith category in the training data is closer to that of another category. Therefore, Uil can be used as a discriminative parameter, and Ui of each layer may be stored as a d*d fully connected layer parameter through a network. Finally, a gradient expression of an objective function, for example, Formula (1), is calculated, and an operation of projection to Ps
Alternatively, a specific formula is as follows:
where
mi is a quantity of pieces of data whose category label corresponds to an ith category in m pieces of training data, m=ΣiK mi, K is a quantity of all categories of category labels in the m pieces of training data, Zl-1 is the first data feature, Πi is the category distribution information of the training data, I is an identity matrix, Si is an autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, ϵ is a regularization parameter, Ŝi is a regularized autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, S is autocorrelation matrixes of data whose category labels correspond to all the categories in the training data, Ŝ is regularized autocorrelation matrixes of the data whose category labels correspond to all the categories in the training data, and Ail is network parameters of the ith category of the 1th layer.
It is determined based on the foregoing formula that if a value of Ail is smaller, it indicates that distribution of the ith category in the training data is closer to that of another category. Therefore, Ail can be used as a discriminative parameter, and Ai of each layer may be stored as a d*d fully connected layer parameter through a network. Finally, a gradient expression of an objective function, for example, Formula (2), is calculated, and an operation of projection to Ps
Specifically, the determining network parameters of the 1th layer based on the first data feature Zl-1 and the category distribution information Πi of the training data may specifically alternatively include the following manners: determining gradient parameters based on the category distribution information Πi of the training data; and determining the network parameters of the 1th layer based on the first data feature Zl-1 and the gradient parameters.
A specific formula is as follows:
where
Zl-1 satisfies an energy constraint: Tr(Zl-1(Zl-1)T)=m(1+σ2d), σ is a Gaussian distribution variance, m is a quantity of samples of the training data, d is a dimension of the training data, Zl-1 is the first data feature, e∈Rm×1 is a column vector whose elements are all 1, Πi is the category distribution information of the training data, Tr( ) represents a trace operation, I is an identity matrix, mi is a quantity of pieces of data whose category label corresponds to an ith category in the m pieces of training data, m=ΣiK mi, K is a quantity of all categories of category labels in the m pieces of training data, Cil represents network parameters of the ith category of the 1th layer, and G and Hi are the gradient parameters.
Cil is a clustering center, where a simplex vertex may be used as the clustering center and used to mark a reference between categories. If a value of Cil is smaller, it indicates that distribution of the ith category in the training data is closer to that of a jth category in the training data. Therefore, Cil can be used as a discriminative parameter, and intermediate variables Cil of each layer may be stored as a d×d fully connected layer parameter through a network. Finally, a gradient expression of an objective function, for example, Formula (3), is calculated, and an operation of projection to Ps
Specifically, after the network parameters of the 1th layer are determined, that the second data feature may be determined based on the first data feature Zl-1 and the network parameters of the 1th layer specifically includes the following manners: determining an objective function gradient expression based on the network parameters of the 1th layer and the first data feature Zl-1;
and then determining the second data feature Zl based on the first data feature Zl-1, the category distribution information Πi of the training data, and the objective function gradient expression. The objective function gradient expression may be described as Formula (1), Formula (2), or Formula (3). Details are not described herein again.
Zl is the second data feature, A represents a step or a learning rate, ∂L/∂Z is the objective function gradient expression, and Zl-1 is the first data feature.
To better describe a training process of the feedforward neural network model, descriptions are provided by using examples in which Formula (1), Formula (2), and Formula (3) are respectively used as the objective function gradient expression. Details are as follows.
In an example, an example in which the objective function gradient expression is Formula (1) is used. A computation process of an 1th layer in the feedforward neural network model is shown in
where
Zl-1 is the feature of the (1-1)th layer, λ represents a step or a learning rate, ∂L/∂Z Formula (1), and Zl is the feature of the 1th layer.
In an example, an example in which the objective function gradient expression is Formula (2) is used. A computation process of an 1th layer in the feedforward neural network model is shown in
where
Zl-1 is the feature of the (1-1)th layer, A represents a step or a learning rate, ∂L/∂Z Formula (2), and Zl is the feature of the 1th layer.
In an example, an example in which the objective function gradient expression is Formula (3) is used. A computation process of an 1th layer in the feedforward neural network model is shown in
where
Zl-1 is the feature of the (1-1)th layer, A represents a step or a learning rate, ∂L/∂Z Formula (3), Zl is the feature of the 1th layer, and Zl-1 is constrained in (d-1)-dimensional unit sphere space.
In the foregoing method, the feedforward neural network model is provided, so as to reduce communication overheads between a transmit end and a receive end caused by BP algorithm training and interaction, and improve training efficiency. In addition, in a scenario of dealing with different transmission/receiving network structures, a quantity of network layers is adjusted to improve training accuracy, to avoid a problem that retraining is needed due to different adaptions to different transmission/receiving networks.
The following describes in detail a method in embodiments of this application.
Specifically, a process of determining the feedforward neural network model may be shown in
Step S1002: Obtain to-be-processed data of unknown classification or clustering information.
Optionally, the to-be-processed data of the unknown classification or clustering information does not include category labels.
Step S1003: Input the To-Be-Processed Data into the Feedforward Neural Network Model to Determine a Data Feature of the To-Be-Processed Data.
Specifically, the data feature of the to-be-processed data is the classification or clustering information representing the to-be-processed data. The data feature of the to-be-processed data is used to determine a classification or clustering result of the to-be-processed data. A dimension of the data feature of the to-be-processed data is related to a data type of the to-be-processed data. For example, for selection of the dimension of the data feature, it can be learned from a VC (Vapnik-Chervonenkis) dimension theory that a higher VC dimension indicates higher model complexity and easier differentiation. However, overfitting is likely to occur if the dimension is excessively high. Therefore, a proper dimension needs to be determined. A general estimation manner of determining a dimension lower limit is to calculate an eigenvalue of an autocorrelation matrix of original data, remove some dimensions whose eigenvalues are close to 0, and use the remaining dimensions as dimensions for extracting features. In addition, the dimension may be refined for different data types. For example, if a data type of to-be-processed data is a picture, a dimension of a data feature of the to-be-processed data may be 1000. If the data type of the to-be-processed data is a text, a dimension of the data feature of the to-be-processed data may be 768.
The process of inputting the to-be-processed data into the feedforward neural network model to determine the data feature of the to-be-processed data may be understood as a deduction process, which is specifically as follows.
In a possible implementation, the inputting the to-be-processed data into the feedforward neural network model to determine a data feature of the to-be-processed data specifically includes the following manners: determining, based on the to-be-processed data and the network parameters of the 1th layer, category distribution information Πil that corresponds to predicted category labels and that is of the to-be-processed data; determining an objective function gradient expression based on the to-be-processed data and the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data; and determining the data feature of the to-be-processed data based on the to-be-processed data and the objective function gradient expression.
The determining, based on the to-be-processed data and the network parameters of the 1th layer, category distribution information Πil that corresponds to predicted category labels and that is of the to-be-processed data may specifically include the following manners: determining projections of the predicted category labels in the to-be-processed data on a first category based on the to-be-processed data and the network parameters of the 1th layer, where the first category is any one of a plurality of categories corresponding to the predicted category labels in the to-be-processed data; and determining, based on the projections of the predicted category labels in the to-be-processed data on the first category, the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data.
A specific formula for determining the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data is as follows:
pil=UilZ; and
where
Z is the to-be-processed data, Uil is network parameters of an ith category of the 1th layer, pil is projections of the predicted category labels in the to-be-processed data on the ith category of the 1th layer, Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, and η is a hyperparameter for controlling estimation confidence. pil=UilZ may be understood as the projections of the to-be-processed data on the ith category of the 1th layer. When a value of pil is smaller, it indicates a closer correlation with the ith category of the 1th layer. Therefore, the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data may be determined by using a softmax function. The specific formula is shown above.
After the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data is determined based on the foregoing formula, the objective function gradient expression is determined based on the to-be-processed data and the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data, where the objective function gradient expression may be specifically shown in Formula (1). Then, the data feature of the to-be-processed data is determined based on the to-be-processed data and the objective function gradient expression.
Alternatively, a specific formula for determining the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data is as follows:
pil=AilZ; and
where
Z is the to-be-processed data, Ail is network parameters of an ith category of the 1th layer, pil is projections of the predicted category labels in the to-be-processed data on the ith category of the 1th layer, Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, and η is a hyperparameter for controlling estimation confidence. pil=AilZ may be understood as the projections of the to-be-processed data on the ith category of the 1th layer. When a value of pil is smaller, it indicates a closer correlation with the ith category of the 1th layer. Therefore, the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data may be determined by using a softmax function. The specific formula is shown above.
After the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data is determined based on the foregoing formula, the objective function gradient expression is determined based on the to-be-processed data and the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data, where the objective function gradient expression may be specifically shown in Formula (2). Then, the data feature of the to-be-processed data is determined based on the to-be-processed data and the objective function gradient expression.
Alternatively, the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data includes one or more of the following: distance information, correlation information, differential information, or soft classification information. A specific formula for determining the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data based on to-be-processed data and the network parameters of the 1th layer is as follows:
Πil=argmin dist(Z, Cil); or
Πil=argmin <Z, Cil>; or
where
Z is the to-be-processed data, Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, Cil is network parameters of an ith category of the 1th layer, Zl is a data feature of the to-be-processed data at the 1th layer, Zl-1 is a data feature of the to-be-processed data at the (1-1)th layer, and <> represents an inner product.
After the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data is determined, the determining an objective function gradient expression based on the to-be-processed data and the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data specifically includes: determining gradient parameters (G and Hi) based on the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data; and determining the objective function gradient expression based on the to-be-processed data and the gradient parameters. A specific formula is as follows:
G=[g1, g2, . . . , gi]; and
where
Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, Tr( ) represents a trace operation, I is an identity matrix, mi is a quantity of pieces of data of an ith category in m pieces of to-be-processed data, m=ΣiK mi, K is a quantity of all categories of predicted category labels in the m pieces of to-be-processed data, and G and Hi represent the gradient parameters. The objective function gradient expression is shown in Formula (3). Then, the data feature of the to-be-processed data is determined based on the to-be-processed data and the objective function gradient expression.
In a possible implementation, the method further includes: outputting the data feature of the to-be-processed data.
In another possible implementation, the to-be-processed data of the unknown classification or clustering information is a data feature of third data, the data feature of the third data is determined through another feed forward neural network, input information of an 1th layer in the another feedforward neural network includes category distribution information of training data and a first data feature, output information of the 1th layer includes a second data feature, the first data feature is an output of an (1-1)th layer, both the first data feature and the second data feature are classification or clustering information representing the training data, and 1 is a positive integer greater than 1. In other words, the data feature of the third data is determined through another feedforward neural network. The data feature of the third data is the to-be-processed data of the unknown classification or clustering information. Then, the to-be-processed data is inputted into a determined feedforward neural network model to obtain the data feature of the to-be-processed data.
To better describe a deduction process of the feedforward neural network model, descriptions are provided by using examples in which Formula (1), Formula (2), and Formula (3) are respectively used as the objective function gradient expression. Details are as follows.
In an example, an example in which the objective function gradient expression is Formula (1) is used. A computation process of an 1th layer in the deduction process in the feedforward neural network model is shown in
In an example, an example in which the objective function gradient expression is Formula (2) is used. A computation process of the 1th layer in the deduction process in the feedforward neural network model is shown in
In an example, an example in which the objective function gradient expression is Formula (3) is used. A computation process of the 1th layer in the deduction process in the feedforward neural network model is shown in
In embodiments of this application, the data processing methods shown in
(1) The multi-view scenario is shown in
When the data feature Z1 extracted by the first transmit end and the data feature Z2 extracted by the second transmit end are sent to the receive end through channel transmission, the following condition is met:
where
Z1 represents a data feature before channel transmission, represents a feature matrix after channel transmission, n represents Gaussian noise n˜(0,σ2) whose standard deviation is σ, and Var(′) represents a variance.
(2) As shown in
In embodiments of this application, after training the feedforward neural network model according to the data processing method shown in
In an implementation, the training device separately trains a designed objective function in a gradient backhaul manner and a feedforward propagation manner. Details are as follows.
The training device uses Formula (1) as an objective function, where an MNIST handwritten font set is used as an example, and a used feature dimension is 128, obtains a result before a readout layer through training by using a Resnet18 network, and reduces the result before the readout layer to 2D visualized data by using a T-distributed stohastic neighbor embedding (t-distributed stohastic neighbor embedding, t-SNE) algorithm, which is specifically shown in
A multi-layer network structure is designed according to the foregoing feedforward neural network solution. A result of a k-nearest neighbor (k-nearest neighbor, KNN) classification algorithm of a final output feature is tested through an AWGN channel, which is specifically shown in Table 1. An MNIST handwritten font set is used as an example. A used feature dimension is the same as an input dimension and is 768, a learning rate of the feedforward neural network model λ=0.001, a signal-to-noise ratio SNR=25 db, η=500, η is a hyperparameter for controlling estimation confidence when category labels are predicted, and a quantity of training samples m=1000. It can be learned from Table 1 that, as a quantity of layers in a feedforward neural network increases, accuracy of an extracted data feature is higher. For example, when a quantity of intermediate layers in the feedforward neural network is 2, accuracy of a training set is 0.5247. When the quantity of intermediate layers in the feedforward neural network is 6, accuracy of the training set is 0.7135. The accuracy of the training set when the quantity of intermediate layers is 6 is higher than the accuracy of the training set when the quantity of intermediate layers is 2 by 0.1888.
In the foregoing method, compared with a BP algorithm in which a transmit-end network needs to be updated through gradient backhaul, the method in embodiments of this application can reduce communication overheads caused by training and interaction, and improve training efficiency. A receive end needs to train only a readout layer network. In addition, a structure of the feedforward neural network is more flexible, and accuracy can be improved by increasing a quantity of network layers. In other words, when a value of 1 is larger, accuracy of the classification or clustering result of the to-be-processed data is higher, thereby avoiding a problem that retraining is needed due to different adaptations of different transmit/receive-end networks. In addition, the feedforward neural network model is interpretable, and a black box problem of a neural network can be interpreted. In addition, the output data feature of the to-be-processed data may be used as data preprocessing, and can be used for a subsequent readout layer operation.
The method in embodiments of this application is described in detail above. An apparatus in embodiments of this application is provided below.
The first determining unit 1701 is configured to determine a feedforward neural network model, where input information of an 1th layer in the feedforward neural network model includes category distribution information of training data and a first data feature, output information of the 1th layer includes a second data feature, the first data feature is an output of an (1-1)th layer, both the first data feature and the second data feature are classification or clustering information representing the training data, and 1 is a positive integer greater than 1.
The obtaining unit 1702 is configured to obtain to-be-processed data of unknown classification or clustering information.
The second determining unit 1703 is configured to input the to-be-processed data into the feedforward neural network model to determine a data feature of the to-be-processed data, where the data feature of the to-be-processed data is the classification or clustering information representing the to-be-processed data, and the data feature of the to-be-processed data is used to determine a classification or clustering result of the to-be-processed data.
In a possible implementation, a dimension of the data feature of the to-be-processed data is related to a data type of the to-be-processed data.
In another possible implementation, when 1=2, and the first data feature is an output of a 1st layer, input information of the 1st layer includes the category distribution information of the training data and the training data, where the training data includes category labels, and the category distribution information of the training data is determined based on the category labels in the training data.
In another possible implementation, the first determining unit 1701 is specifically configured to obtain the first data feature Zl-1; and determine network parameters of the 1th layer based on the first data feature Zl-1 and the category distribution information Πi of the training data, where the second data feature is determined based on the first data feature Zl-1 and the network parameters of the 1th layer.
In another possible implementation, the first determining unit 1701 is specifically configured to determine an objective function gradient expression based on the network parameters of the 1th layer and the first data feature Zl-1; and determine the second data feature Zl based on the first data feature Zl-1, the category distribution information Πi of the training data, and the objective function gradient expression.
In another possible implementation, the first determining unit 1701 is specifically configured to determine, based on the first data feature Zl-1 and the category distribution information Πi of the training data, a regularized autocorrelation matrix of data whose category label corresponds to each category in the training data; and determine the network parameters of the 1th layer based on the regularized autocorrelation matrix of the data whose category label corresponds to each category in the training data.
In another possible implementation,
where
mi is a quantity of pieces of data whose category label corresponds to an ith category in m pieces of training data, m=ΣiK mi, K is a quantity of all categories of category labels in the m pieces of training data,
is a weight parameter used to balance quantities of samples of the categories in the training data, Zl-1 is the first data feature, Πi is the category distribution information of the training data, Si is an autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, ϵ is a regularization parameter, Ŝi is a regularized autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, and Uil is network parameters of the ith category of the 1th layer.
In another possible implementation,
where
mi is a quantity of pieces of data whose category label corresponds to an ith category in m pieces of training data, m=ΣiK mi, K is a quantity of all categories of category labels in the m pieces of training data, Zl-1 is the first data feature, Πi is the category distribution information of the training data, I is an identity matrix, Si is an autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, ϵ is a regularization parameter, Ŝi is a regularized autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, S is autocorrelation matrixes of data whose category labels correspond to all the categories in the training data, Ŝ is regularized autocorrelation matrixes of the data whose category labels correspond to all the categories in the training data, and Ail is network parameters of the ith category of the 1th layer.
In another possible implementation, the first determining unit 1701 is specifically configured to determine gradient parameters based on the category distribution information Πi of the training data; and determine the network parameters of the 1th layer based on the first data feature Zl-1 and the gradient parameters.
In another possible implementation,
where
Zl-1 satisfies an energy constraint: Tr(Zl-1(Zl-1)T)=m (1+σ2d), σ is a Gaussian distribution variance, m is a quantity of samples of the training data, d is a dimension of the training data, Zl-1 is the first data feature, e∈Rm×1 is a column vector whose elements are all 1, Πi is the category distribution information of the training data, Tr( ) represents a trace operation, I is an identity matrix, mi is a quantity of pieces of data whose category label corresponds to an ith category in the m pieces of training data, m=ΣiK mi, K is a quantity of all categories of category labels in the m pieces of training data, Cil is network parameters of the ith category of the 1th layer, and G and Hi are the gradient parameters.
In another possible implementation, the second determining unit 1703 is specifically configured to determine, based on the to-be-processed data and the network parameters of the 1th layer, category distribution information Πil that corresponds to predicted category label and that is of the to-be-processed data; determine an objective function gradient expression based on the to-be-processed data and the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data; and determine the data feature of the to-be-processed data based on the to-be-processed data and the objective function gradient expression.
In another possible implementation, the second determining unit 1703 is specifically configured to determine projections of the predicted category labels in the to-be-processed data on a first category based on the to-be-processed data and the network parameters of the 1th layer, where the first category is any one of a plurality of categories corresponding to the predicted category labels in the to-be-processed data; and determine, based on the projections of the predicted category labels in the to-be-processed data on the first category, the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data.
In another possible implementation,
pil=UilZ; and
where
Z is the to-be-processed data, Uil is network parameters of an ith category of the 1th layer, pil is projections of the predicted category labels in the to-be-processed data on the ith category of the 1th layer, Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, and η is a hyperparameter for controlling estimation confidence.
In another possible implementation, the objective function gradient expression includes:
where
mi is a quantity of pieces of data whose predicted category label is the ith category in m pieces of to-be-processed data, m=ΣiK mi, K is a quantity of all categories of predicted category labels in the m pieces of to-be-processed data,
is a weight parameter used to balance quantities of samples of the predicted categories in the to-be-processed data, Z is the to-be-processed data, Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, Si is an autocorrelation matrix of the data whose predicted category label corresponds to the ith category in the to-be-processed data, and Ŝi is a regularized autocorrelation matrix of the data whose predicted category label corresponds to the ith category in the to-be-processed data.
In another possible implementation,
pil=AilZ; and
where
Z is the to-be-processed data, Ail is network parameters of an ith category of the 1th layer, pil is projections of the predicted category labels in the to-be-processed data on the ith category of the 1th layer, Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, and η is a hyperparameter for controlling estimation confidence.
In another possible implementation, the objective function gradient expression includes:
where
mi is a quantity of pieces of data whose predicted category label is an ith category in m pieces of to-be-processed data,
αi is a weight parameter used to balance quantities of samples of predicted categories in the to-be-processed data, Z is the to-be-processed data, Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, Si is an autocorrelation matrix of the data whose predicted category label corresponds to the ith category in the to-be-processed data, Si is a regularized autocorrelation matrix of the data whose predicted category label corresponds to the ith category in the to-be-processed data, S is autocorrelation matrixes of data whose predicted category labels correspond to all the categories in the to-be-processed data, and Ŝ is regularized autocorrelation matrixes of the data whose predicted category labels correspond to all the categories in the to-be-processed data.
In another possible implementation, the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data includes one or more of the following: distance information, correlation information, differential information, or soft classification information.
In another possible implementation, Πil=argmin dist(Z, Cil); or
Πil=argmin <Z, Cil>; or
where
Z is the to-be-processed data, Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, Cil is network parameters of an ith category of the 1th layer, Zl is a data feature of the to-be-processed data at the 1th layer, Zl-1 is a data feature of the to-be-processed data at the (1-1)th layer, and <> represents an inner product.
In another possible implementation, the second determining unit 1703 is specifically configured to determine gradient parameters (G and Hi) based on the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data; and determine the objective function gradient expression based on the to-be-processed data and the gradient parameters.
In another possible implementation,
G=[g1, g2, . . . , gi]; and
where
Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, Tr( ) represents a trace operation, I is an identity matrix, mi is a quantity of pieces of data of an ith category in m pieces of to-be-processed data, m=ΣiK mi, K is a quantity of all categories of predicted category labels in the m pieces of to-be-processed data, and G and Hi represent the gradient parameters.
In another possible implementation, the objective function gradient expression includes:
where
Z is the to-be-processed data, σ is a Gaussian distribution variance, ϵ is a regularization parameter, I is an identity matrix, G and Hi represent the gradient parameters, and β represents a regularization parameter.
In another possible implementation,
where
Zl is the data feature of the to-be-processed data, ∂L/∂Z is the objective function gradient expression, Zl-1 is the to-be-processed data, and Zl-1 is constrained in (d-1)-dimensional unit sphere space.
In another possible implementation, the data processing apparatus further includes an output unit. The output unit is configured to output the data feature of the to-be-processed data.
In another possible implementation, the to-be-processed data of the unknown classification or clustering information is a data feature of third data, the data feature of the third data is determined through another feed forward neural network, input information of an 1th layer in the another feedforward neural network includes category distribution information of training data and a first data feature, output information of the 1th layer includes a second data feature, the first data feature is an output of an (1-1)th layer, both the first data feature and the second data feature are classification or clustering information representing the training data, and 1 is a positive integer greater than 1.
It should be noted that, for implementation and beneficial effects of the units, refer to corresponding descriptions of the method embodiment shown in
The memory 1802 includes but is not limited to a random access memory (random access memory, RAM), a read-only memory (read-only memory, ROM), an erasable programmable read-only memory (erasable programmable read-only memory, EPROM), or a compact disc read-only memory (compact disc read-only memory, CD-ROM). The memory 1802 is configured to store a related computer program and data. The communication interface 1803 is configured to receive and send data.
The processor 1801 may be one or more central processing units (central processing units, CPUs). When the processor 1801 is one CPU, the CPU may be a single-core CPU or a multi-core CPU.
The processor 1801 in the data processing apparatus 1800 is configured to read computer program code stored in the memory 1802, to perform the following operations:
In a possible implementation, a dimension of the data feature of the to-be-processed data is related to a data type of the to-be-processed data.
In another possible implementation, when 1=2, and the first data feature is an output of a 1st layer, input information of the 1st layer includes the category distribution information of the training data and the training data, where the training data includes category labels, and the category distribution information of the training data is determined based on the category labels in the training data.
In another possible implementation, the processor 1801 is configured to obtain the first data feature Zl-1; and determine network parameters of the 1th layer based on the first data feature Zl-1 and the category distribution information Πi of the training data, where the second data feature is determined based on the first data feature Zl-1 and the network parameters of the 1th layer.
In another possible implementation, the processor 1801 is configured to determine an objective function gradient expression based on the network parameters of the 1th layer and the first data feature Zl-1; and determine the second data feature Zl based on the first data feature Zl-1, the category distribution information Πi of the training data, and the objective function gradient expression.
In another possible implementation, the processor 1801 is configured to determine, based on the first data feature Zl-1 and the category distribution information Πi of the training data, a regularized autocorrelation matrix of data whose category label corresponds to each category in the training data; and determine the network parameters of the 1th layer based on the regularized autocorrelation matrix of the data whose category label corresponds to each category in the training data.
In another possible implementation,
where
mi is a quantity of pieces of data whose category label corresponds to an ith category in m pieces of training data, m=ΣiK mi, K is a quantity of all categories of category labels in the m pieces of training data,
is a weight parameter used to balance quantities of samples of the categories in the training data, Zl-1 is the first data feature, Πi is the category distribution information of the training data, Si is an autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, ϵ is a regularization parameter, Ŝi is a regularized autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, and Uil is network parameters of the ith category of the 1th layer.
where
mi is a quantity of pieces of data whose category label corresponds to an ith category in m pieces of training data, m=ΣiK mi, K is a quantity of all categories of category labels in the m pieces of training data, Zl-1 is the first data feature, Πi is the category distribution information of the training data, I is an identity matrix, Si is an autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, ϵ is a regularization parameter, Si is a regularized autocorrelation matrix of the data whose category label corresponds to the ith category in the training data, S is autocorrelation matrixes of data whose category labels correspond to all the categories in the training data, Ŝ is regularized autocorrelation matrixes of the data whose category labels correspond to all the categories in the training data, and Ail is network parameters of the ith category of the 1th layer.
In another possible implementation, the processor 1801 is configured to determine gradient parameters based on the category distribution information Πi of the training data; and determine the network parameters of the 1th layer based on the first data feature Zl-1 and the gradient parameters.
In another possible implementation,
where
Zl-1 satisfies an energy constraint: Tr(Zl-1(Zl-1)T)=m(1 +σ2d), σ is a Gaussian distribution variance, m is a quantity of samples of the training data, d is a dimension of the training data, Zl-1 is the first data feature, e∈Rm×1 is a column vector whose elements are all 1, Πi is the category distribution information of the training data, Tr( ) represents a trace operation, I is an identity matrix, mi is a quantity of pieces of data whose category label corresponds to an ith category in the m pieces of training data, m=ΣiK mi, K is a quantity of all categories of category labels in the m pieces of training data, Cil is network parameters of the ith category of the 1th layer, and G and Hi are the gradient parameters.
In another possible implementation, the processor 1801 is configured to determine, based on the to-be-processed data and the network parameters of the 1th layer, category distribution information Πil that corresponds to predicted category labels and that is of the to-be-processed data; determine an objective function gradient expression based on the to-be-processed data and the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data; and determine the data feature of the to-be-processed data based on the to-be-processed data and the objective function gradient expression.
In another possible implementation, the processor 1801 is configured to determine projections of the predicted category labels in the to-be-processed data on a first category based on the to-be-processed data and the network parameters of the 1th layer, where the first category is any one of a plurality of categories corresponding to the predicted category labels in the to-be-processed data; and determine, based on the projections of the predicted category labels in the to-be-processed data on the first category, the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data.
In another possible implementation,
pil=UilZ; and
where
Z is the to-be-processed data, Uil is network parameters of an ith category of the 1th layer, pil is projections of the predicted category labels in the to-be-processed data on the ith category of the 1th layer, Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, and η is a hyperparameter for controlling estimation confidence.
In another possible implementation, the objective function gradient expression includes:
where
mi is a quantity of pieces of data whose predicted category label is the ith category in m pieces of to-be-processed data, m=ΣiK mi, K is a quantity of all categories of predicted category labels in the m pieces of to-be-processed data,
is a weight parameter used to balance quantities of samples of the predicted categories in the to-be-processed data, Z is the to-be-processed data, Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, Si is an autocorrelation matrix of the data whose predicted category label corresponds to the ith category in the to-be-processed data, and Ŝi is a regularized autocorrelation matrix of the data whose predicted category label corresponds to the ith category in the to-be-processed data.
In another possible implementation,
pil=AilZ; and
where
Z is the to-be-processed data, Ail is network parameters of an ith category of the 1th layer, pil is projections of the predicted category labels in the to-be-processed data on the ith category of the 1th layer, Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, and η is a hyperparameter for controlling estimation confidence.
In another possible implementation, the objective function gradient expression includes:
where
mi is a quantity of pieces of data whose predicted category label is an ith category in m pieces of to-be-processed data,
αi is a weight parameter used to balance quantities of samples of predicted categories in the to-be-processed data, Z is the to-be-processed data, Πiis the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, Si is an autocorrelation matrix of the data whose predicted category label corresponds to the ith category in the to-be-processed data, Ŝi is a regularized autocorrelation matrix of the data whose predicted category label corresponds to the ith category in the to-be-processed data, S is autocorrelation matrixes of data whose predicted category labels correspond to all the categories in the to-be-processed data, and Ŝ is regularized autocorrelation matrixes of the data whose predicted category labels correspond to all the categories in the to-be-processed data.
In another possible implementation, the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data includes one or more of the following: distance information, correlation information, differential information, or soft classification information. In another possible implementation, Πil=argmin dist(Z, Cil); or
Πil=argmin <Z, Cil>; or
where
Z is the to-be-processed data, Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, Cil is network parameters of an ith category of the 1th layer, Zl is a data feature of the to-be-processed data at the 1th layer, Zl-1 is a data feature of the to-be-processed data at the (1-1)th layer, and <> represents an inner product.
In another possible implementation, the processor 1801 is configured to determine gradient parameters (G and Hi) based on the category distribution information Πil that corresponds to the predicted category labels and that is of the to-be-processed data; and determine the objective function gradient expression based on the to-be-processed data and the gradient parameters.
In another possible implementation,
G=[g1, g2, . . . , gi]; and
where
Πil is the category distribution information that corresponds to the predicted category labels and that is of the to-be-processed data, Tr( ) represents a trace operation, I is an identity matrix, mi is a quantity of pieces of data of an ith category in m pieces of to-be-processed data, m=ΣiK mi, K is a quantity of all categories of predicted category labels in the m pieces of to-be-processed data, and G and Hi represent the gradient parameters.
In another possible implementation, the objective function gradient expression includes:
where
Z is the to-be-processed data, σ is a Gaussian distribution variance, ϵ is a regularization parameter, I is an identity matrix, G and Hi represent the gradient parameters, and β represents a regularization parameter.
In another possible implementation,
where
Zl is the data feature of the to-be-processed data, ∂L/∂Z is the objective function gradient expression, Zl-1 is the to-be-processed data, and Zl-1 is constrained in (d-1)-dimensional unit sphere space.
In another possible implementation, the processor 1801 is configured to output the data feature of the to-be-processed data.
In another possible implementation, the to-be-processed data of the unknown classification or clustering information is a data feature of third data, the data feature of the third data is determined through another feed forward neural network, input information of an 1th layer in the another feedforward neural network includes category distribution information of training data and a first data feature, output information of the 1th layer includes a second data feature, the first data feature is an output of an (1-1)th layer, both the first data feature and the second data feature are classification or clustering information representing the training data, and 1 is a positive integer greater than 1.
It should be noted that, for implementation and beneficial effects of the operations, refer to corresponding descriptions of the method embodiment shown in
It may be understood that the processor in embodiments of this application may be a central processing unit (Central Processing Unit, CPU), or may be another general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application-specific integrated circuit (Application-Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or another programmable logic device, a transistor logic device, a hardware component, or any combination thereof. The general purpose processor may be a microprocessor or any regular processor.
The method steps in embodiments of this application may be implemented in a hardware manner, or may be implemented in a manner of executing software instructions by the processor. The software instructions may include a corresponding software module. The software module may be stored in a random access memory, a flash memory, a read-only memory, a programmable read-only memory, an erasable programmable read-only memory, an electrically erasable programmable read-only memory, a register, a hard disk, a removable hard disk, a CD-ROM, or any other form of storage medium well-known in the art. For example, a storage medium is coupled to a processor, so that the processor can read information from the storage medium and write information into the storage medium. Certainly, the storage medium may alternatively be a component of the processor. The processor and the storage medium may be disposed in an ASIC. In addition, the ASIC may be located in a base station or a terminal. Certainly, the processor and the storage medium may alternatively exist in a base station or a terminal as discrete components.
All or some of the foregoing embodiments may be implemented by using software, hardware, firmware, or any combination thereof. When the software is used to implement embodiments, all or a part of embodiments may be implemented in a form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer programs or instructions are loaded and executed on a computer, all or some of the procedures or functions in embodiments of this application are executed. The computer may be a general purpose computer, a dedicated computer, a computer network, a network device, user equipment, or another programmable apparatus. The computer programs or instructions may be stored in a computer-readable storage medium, or may be transmitted from a computer-readable storage medium to another computer-readable storage medium. For example, the computer programs or instructions may be transmitted from a website, computer, server, or data center to another website, computer, server, or data center in a wired or wireless manner. The computer-readable storage medium may be any usable medium that can be accessed by the computer, or a data storage device, for example, a server or a data center, integrating one or more usable media. The usable medium may be a magnetic medium, for example, a floppy disk, a hard disk, or a magnetic tape; or may be an optical medium, for example, a digital video disc; or may be a semiconductor medium, for example, a solid-state drive. The computer-readable storage medium may be a volatile or non-volatile storage medium, or may include two types of storage media: a volatile storage medium and a non-volatile storage medium.
In embodiments of this application, unless otherwise stated or there is a logic conflict, terms and/or descriptions between different embodiments are consistent and may be mutually referenced, and technical features in different embodiments may be combined into a new embodiment based on an internal logical relationship thereof.
In the descriptions of this application, terms such as “first”, “second”, “S601”, or “S602” are merely used for distinguishing between descriptions and for ease of context. Different sequence numbers have no specific technical meaning, and cannot be understood as an indication or implication of relative importance, or an indication or implication of an execution sequence of operations. The execution sequence of each process should be determined based on functions and internal logic of the processes.
The term “and/or” in this application describes only an association relationship for associated objects, and indicates that three relationships may exist. For example, “A and/or B” may indicate the following three cases: Only A exists; both A and B exist; or only B exists. A and B may be singular or plural. In addition, the character “/” in this specification indicates an “or” relationship between the associated objects.
In this application, “transmission” may include the following three cases: data sending, data receiving, or data sending and data receiving. In this application, “data” may include service data and/or signaling data.
In this application, the terms “include” or “have” and any variation thereof are intended to cover non-exclusive inclusion. For example, a process/method that includes a series of steps, or a system/product/device that includes a series of units is not necessarily limited to those expressly listed steps or units, but may include other steps or units not explicitly listed or inherent to these processes/methods/products/devices.
In addition, in descriptions of this application, unless otherwise specified, a quantity of nouns indicates “a singular noun or a plural noun”, that is, “one or more”. “At least one” indicates one or more. “At least one of the following: A, B, and C is included” may indicate that A is included, B is included, C is included, A and B are included, A and C are included, B and C are included, or A, B and C are included. A, B, and C may be one or more. A, B, and C may be singular or plural.
Number | Date | Country | Kind |
---|---|---|---|
202210290759.2 | Mar 2022 | CN | national |
This application is a continuation of International Application No. PCT/CN2023/082740, filed on Mar. 21, 2023, which claims priority to Chinese Patent Application No. 202210290759.2, filed on Mar. 23, 2022. The disclosures of the aforementioned applications are hereby incorporated by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2023/082740 | Mar 2023 | WO |
Child | 18892583 | US |