METHOD AND APPARATUS FOR TRAINING INTELLIGENT MODEL

TECHNICAL FIELD

This application relates to the communication field, and more specifically, to a method and an apparatus for training an intelligent model.

BACKGROUND

Artificial intelligence (AI) is a quite important application in future wireless communication networks (such as the Internet of Things). Federated learning (FL) is a distributed intelligent model training method. A server provides model parameters for a plurality of devices. The plurality of devices separately perform intelligent model training based on respective datasets and then feed back gradient information of a loss function to the server. The server updates the model parameters based on the gradient information fed back by the plurality of devices.

A model of the plurality of devices participating in model training is the same as a model of the server, and types of training data used by the devices participating in model training are the same. For example, in training of an image recognition model, a plurality of image collection devices may train the model by using image data separately collected by the plurality of image collection devices. In this manner, diversity of training data can be improved, but diversity of features of an inference target is not considered. Currently, there is no effective solution for implementing model training with diversity of features in federated learning to improve model performance.

SUMMARY

This application provides a method and an apparatus for training an intelligent model, to implement distributed training based on different features through federated learning, so that performance of a model obtained through training can be improved.

According to a first aspect, a method for training an intelligent model is provided, where a central node and a plurality of participant node groups jointly perform training of the intelligent model, the intelligent model consists of a plurality of feature models corresponding to a plurality of features of an inference target, participant nodes in one of the participant node groups train one of the feature models, and the training method is performed by a first participant node that trains a first feature model and that is in the plurality of participant node groups, and the method includes: receiving first information from the central node, where the first information indicates an inter-feature constraint variable, and the inter-feature constraint variable represents a constraint relationship between different features; obtaining first gradient information based on the inter-feature constraint variable, a model parameter of the first feature model, and first sample data by using a gradient inference model, where the first gradient information is gradient information corresponding to the inter-feature constraint variable; and sending the first gradient information to the central node.

Based on the foregoing solution, as model parameters are updated in a model training process, different types of participant nodes calculate the gradient information corresponding to the inter-feature constraint variable, and feed back the gradient information to the central node. The central node updates, based on the gradient information that is of the inter-feature constraint variable and that is obtained through inference by participant nodes corresponding to different feature models, the inter-feature constraint variable that represents an association relationship between features, to implement inter-feature decoupling of the models. In this way, the different types of participant nodes can train different feature models based on the inter-feature constraint variable and local feature data, so that the central node can update the inter-feature constraint variable based on gradients that are of the inter-feature constraint variable and that are fed back by the different types of participant nodes. This implements diversity of training data and diversity of features of federated learning without a need to transmit raw data. To be specific, raw data leakage is avoided, and the performance of the trained model can be improved.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: receiving a first identifier set from the central node, where the first identifier set includes an identifier of sample data of the inter-feature constraint variable selected by the central node; and the obtaining first gradient information based on the inter-feature constraint variable, a model parameter of the first feature model, and first sample data by using a gradient inference model, where the first gradient information is gradient information corresponding to the inter-feature constraint variable includes: determining that a sample data set of the first participant node includes the first sample data corresponding to a first identifier, where the first identifier belongs to the first identifier set; and obtaining the first gradient information based on the inter-feature constraint variable, the model parameter of the first feature model, and the first sample data by using the gradient inference model, where the first gradient information is the gradient information corresponding to the inter-feature constraint variable.

Based on the foregoing solution, the central node selects, by using the first identifier set, some sample data for inferring the gradient information of the inter-feature constraint variable. A participant node that stores the selected sample data infers, based on the model parameter of the current feature model and the sample data, the gradient information of the inter-feature constraint variable, and feeds back the gradient information to the central node. Compared with a manner in which each participant node participates in feeding back the gradient information of the inter-feature constraint variable, resource overheads and implementation complexity can be reduced.

With reference to the first aspect, in some implementations of the first aspect, the sending the first gradient information to the central node includes: sending quantized first target gradient information to the central node, where the first target gradient information includes the first gradient information, or the first target gradient information includes the first gradient information and first residual gradient information, and the first residual gradient information represents a residual amount that is of gradient information corresponding to the inter-feature constraint variable and that is not sent to the central node before the first gradient information is obtained.

Based on the foregoing solution, after one time of model training, the participant node may send, to the central node, the gradient information obtained through training and the residual amount that is of the gradient information and that is not sent to the central node before the current time of model training. In this way, the central node can obtain the residual gradient information, to improve model training efficiency.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: obtaining second residual gradient information based on the first target gradient information and the quantized first target gradient information, where the second residual gradient information is a residual amount that is in the first target gradient information and that is not sent to the central node.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: determining a first threshold based on first quantization noise information and channel resource information, where the first quantization noise information represents a loss introduced by quantization encoding and decoding on the first target gradient information. The sending quantized first target gradient information to the central node includes: determining that a metric value of the first target gradient information is greater than the first threshold; and sending the quantized first target gradient information to the central node.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: if the metric value of the first target gradient information is less than or equal to the first threshold, determining not to send the quantized first target gradient information to the central node.

Based on the foregoing solution, the participant node determines, based on the quantization noise information and the channel resource information, the threshold that is of the metric value and that is used for determining whether to send the quantized target gradient information. In this manner, the loss introduced by the quantization encoding and decoding on the target information is considered, and whether to send the target information to the central node is determined, to implement adaptive scheduling of a channel environment by the participant node, thereby improving signal transmission reliability and channel resource utilization.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: if the metric value of the first target gradient information is less than the first threshold, determining third residual gradient information, where the third residual gradient information is the first target gradient information.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: obtaining the first quantization noise information based on the channel resource information, communication cost information, and the first target gradient information, where the communication cost information indicates a communication cost weight of a communication resource, and the communication resource includes transmission power and/or transmission bandwidth.

Based on the foregoing solution, the participant node may obtain the first quantization noise information based on the channel resource information, the communication cost information, and the first target gradient information, so that adaptive scheduling can be implemented based on the first quantization noise information, thereby improving signal transmission reliability and channel resource utilization.

With reference to the first aspect, in some implementations of the first aspect, the determining a first threshold based on first quantization noise information and channel resource information includes: determining the transmission bandwidth and/or the transmission power based on the first quantization noise information, the communication cost information, the channel resource information, and the first target gradient information, where the communication cost information indicates the communication cost weight of the communication resource, and the communication resource includes the transmission power and/or the transmission bandwidth; and determining the first threshold based on the first quantization noise information and the communication resource.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: receiving second information from the central node, where the second information indicates the communication cost information.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: training the first feature model based on the inter-feature constraint variable and model training data, to obtain second gradient information; and sending the second gradient information to the central node.

Based on the foregoing solution, inter-feature decoupling of models is implemented by using the inter-feature constraint variable, so that different types of participant nodes can train different feature models based on the inter-feature constraint variable and the local feature data. This not only implements diversity of training data in federated learning, but also implements model training for different features. In this way, the performance of the trained model is improved.

With reference to the first aspect, in some implementations of the first aspect, the sending the second gradient information to the central node includes: sending quantized second target gradient information to the central node, where the second target gradient information includes the second gradient information, or the second target gradient information includes the second gradient information and fourth residual gradient information, and the fourth residual gradient information represents a residual amount that is of gradient information and that is not sent to the central node before the second gradient information is obtained.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: obtaining fifth residual gradient information based on the second target gradient information and the quantized second target gradient information, where the fifth residual gradient information represents a residual amount that is in the second target gradient information and that is not sent to the central node.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: determining a second threshold based on second quantization noise information and the channel resource information, where the second quantization noise information represents a loss introduced by quantization encoding and decoding on the second target gradient information; and the sending quantized second target gradient information to the central node includes: determining that a metric value of the second target gradient information is greater than the second threshold; and sending the quantized second target gradient information to the central node.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: if the metric value of the second target gradient information is less than or equal to the second threshold, determining not to send the quantized second target gradient information to the central node.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: if the metric value of the second target gradient information is less than the second threshold, determining sixth residual gradient information, where the sixth residual gradient information is the second target gradient information.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: obtaining the second quantization noise information based on the channel resource information, the communication cost information, and the second target gradient information, where the communication cost information indicates the communication cost weight of the communication resource, and the communication resource includes the transmission power and/or the transmission bandwidth.

With reference to the first aspect, in some implementations of the first aspect, the determining a second threshold based on second quantization noise information and the channel resource information includes: determining the transmission bandwidth and/or the transmission power based on the second quantization noise information, the communication cost information, the channel resource information, and the second target gradient information, where the communication cost information indicates the communication cost weight of the communication resource, and the communication resource includes the transmission power and/or the transmission bandwidth; and determining the second threshold based on the second quantization noise information and the communication resource.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: receiving third information from the central node, where the third information indicates an updated parameter of the first feature model; and updating the parameter of the first feature model based on the third information.

According to a second aspect, a method for training an intelligent model is provided, where a central node and a plurality of participant node groups jointly perform training of the intelligent model, the intelligent model consists of a plurality of feature models corresponding to a plurality of features of an inference target, participant nodes in one of the participant node groups train one of the feature models, and the training method is performed by the central node, and includes: determining an inter-feature constraint variable, where the inter-feature constraint variable represents a constraint relationship between different features; and sending first information to a participant node in the plurality of participant node groups, where the first information includes the inter-feature constraint variable.

With reference to the second aspect, in some implementations of the second aspect, the method further includes: receiving at least one piece of second target gradient information from a participant node in a first participant node group, where the plurality of participant node groups include the first participant node group; determining an updated model parameter of a first feature model based on the at least one piece of second target gradient information, where the first feature model is a feature model trained by the participant node in the first participant node group; and sending the updated model parameter to the first participant node group.

Optionally, that the central node receives the at least one piece of second target gradient information from the participant node in the first participant node group may be specifically: The central node receives at least one piece of quantized second target gradient information from the participant node in the first participant node group, and performs quantization decoding on the quantized second target gradient information to obtain the second target gradient information. It should be understood that, based on the descriptions of the specific implementations in this specification, compared with the second target gradient information before the participant node performs quantization encoding, the second target gradient information obtained by the central node through quantization decoding may have a loss introduced by quantization encoding and decoding.

With reference to the second aspect, in some implementations of the second aspect, the method further includes: sending a first identifier set to the participant node in the plurality of participant node groups, where the first identifier set includes an identifier of sample data of the inter-feature constraint variable selected by the central node.

With reference to the second aspect, in some implementations of the second aspect, the method further includes: receiving a plurality of pieces of first target gradient information from the participant node in the plurality of participant node groups, where the first target gradient information is gradient information that corresponds to the inter-feature constraint variable and that is inferred by the participant node; and the determining an inter-feature constraint variable includes: determining the inter-feature constraint variable based on the plurality of pieces of first target gradient information.

With reference to the second aspect, in some implementations of the second aspect, the method further includes: sending second information to the participant node in the plurality of participant node groups, where the second information indicates communication cost information, the communication cost information indicates a communication cost weight of a communication resource, and the communication resource includes transmission power and/or transmission bandwidth.

According to a third aspect, a communication method is provided, including: determining a threshold based on quantization noise information and channel resource information, where the quantization noise information represents a loss introduced by quantization encoding and decoding on target information; and if a metric value of the target information is greater than the threshold, sending quantized target information; or if the metric value of the target information is less than or equal to the threshold, skipping sending the quantized target information.

Based on the foregoing solution, the participant node determines, based on the to-be-transmitted target information and the communication cost information that is broadcast by the central node by considering the loss introduced by the quantization encoding and decoding on the target information, whether to send the target information to the central node, to implement adaptive scheduling of a channel environment by the participant node, thereby improving reliability of target signal transmission and channel resource utilization.

With reference to the third aspect, in some implementations of the third aspect, the target information includes gradient information obtained through an N^thtime of model training and first target residual information, and the first target residual information is a residual amount that is of gradient information and that is not sent before the gradient information is obtained.

With reference to the third aspect, in some implementations of the third aspect, the method further includes: if the metric value of the target information is greater than the threshold, obtaining second target residual information based on the target information and the quantized target information, where the second target residual information is a residual amount that is not sent and that is in the target information.

With reference to the third aspect, in some implementations of the third aspect, the method further includes: if the metric value of the target information is less than or equal to the threshold, determining third target residual information, where the third target residual information is the target information.

With reference to the third aspect, in some implementations of the third aspect, the method further includes: obtaining the quantization noise information based on the channel resource information, communication cost information, and the target information, where the communication cost information indicates a communication cost weight of a communication resource, and the communication resource includes transmission power and/or transmission bandwidth.

With reference to the third aspect, in some implementations of the third aspect, the determining a threshold based on quantization noise information and channel resource information includes: determining the transmission bandwidth and/or the transmission power based on the quantization noise information, the communication cost information, the channel resource information, and the target information, where the communication cost information indicates the communication cost weight of the communication resource, and the communication resource includes the transmission power and/or the transmission bandwidth; and determining the threshold based on the quantization noise information and the communication resource.

With reference to the third aspect, in some implementations of the third aspect, the method further includes: receiving second information, where the second information indicates the communication cost information.

According to a fourth aspect, an apparatus for training an intelligent model is provided, including: a transceiver unit, configured to receive first information from a central node, where the first information indicates an inter-feature constraint variable, and the inter-feature constraint variable represents a constraint relationship between different features; and a processing unit, configured to obtain first gradient information based on the inter-feature constraint variable, a model parameter of a first feature model, and first sample data by using a gradient inference model, where the first gradient information is gradient information corresponding to the inter-feature constraint variable; and the transceiver unit is further configured to send the first gradient information to the central node.

According to a fifth aspect, an apparatus for training an intelligent model is provided, including: a processing unit, configured to determine an inter-feature constraint variable, where the inter-feature constraint variable represents a constraint relationship between different features; and a transceiver unit, configured to send first information to a participant node in the plurality of participant node groups, where the first information includes the inter-feature constraint variable.

According to a sixth aspect, an intelligent communication apparatus is provided, including: a processing unit, configured to determine a threshold based on quantization noise information and channel resource information, where the quantization noise information represents a loss introduced by quantization encoding and decoding on target information; and a transceiver unit, configured to: if a metric value of the target information is greater than the threshold, send quantized target information; or if the metric value of the target information is less than or equal to the threshold, skip sending the quantized target information.

According to a seventh aspect, a communication apparatus is provided, including a processor. The processor may implement the method in any one of the first aspect and the possible implementations of the first aspect, or implement the method in any one of the second aspect and the possible implementations of the second aspect, or implement the method in any one of the third aspect and the possible implementations of the third aspect.

Optionally, the communication apparatus further includes a memory. The processor is coupled to the memory, and may be configured to execute instructions in the memory, to implement the method in any one of the first aspect and the possible implementations of the first aspect, or implement the method in any one of the second aspect and the possible implementations of the second aspect, or implement the method in any one of the third aspect and the possible implementations of the third aspect.

Optionally, the communication apparatus further includes a communication interface, where the processor is coupled to the communication interface. In this embodiment of this application, the communication interface may be a transceiver, a pin, a circuit, a bus, a module, or another type of communication interface. This is not limited.

In an implementation, the communication apparatus is a communication device. When the communication apparatus is the communication device, the communication interface may be a transceiver or an input/output interface.

In another implementation, the communication apparatus is a chip configured in the communication device. When the communication apparatus is the chip configured in the communication device, the communication interface may be an input/output interface, and the processor may be a logic circuit.

Optionally, the transceiver may be a transceiver circuit. Optionally, the input/output interface may be an input/output circuit.

According to an eighth aspect, a processor is provided, including: an input circuit, an output circuit, and a processing circuit. The processing circuit is configured to: receive a signal by using the input circuit, and transmit a signal by using the output circuit, to enable the processor to perform the method in any one of the first aspect and the possible implementations of the first aspect.

In a specific implementation process, the processor may be one or more chips, the input circuit may be an input pin, the output circuit may be an output pin, and the processing circuit may be a transistor, a gate circuit, a trigger, various logic circuits, or the like. An input signal received by the input circuit may be received and input by, for example, but not limited to, a receiver, a signal output by the output circuit may be output to, for example, but not limited to, a transmitter and transmitted by the transmitter, and the input circuit and the output circuit may be a same circuit, where the circuit is used as the input circuit and the output circuit respectively at different moments. Specific implementations of the processor and the various circuits are not limited in this embodiment of this application.

According to a ninth aspect, a computer program product is provided. The computer program product includes a computer program (which may also be referred to as code or instructions). When the computer program runs, a computer is enabled to perform the method in any one of the first aspect and the possible implementations of the first aspect, or implement the method in any one of the second aspect and the possible implementations of the second aspect, or implement the method in any one of the third aspect and the possible implementations of the third aspect.

According to a tenth aspect, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program (which may also be referred to as code or instructions). When the computer program runs on a computer, the computer is enabled to perform the method in any one of the first aspect and the possible implementations of the first aspect, or implement the method in any one of the second aspect and the possible implementations of the second aspect, or implement the method in any one of the third aspect and the possible implementations of the third aspect.

According to an eleventh aspect, a communication system is provided, including the foregoing plurality of participant nodes and at least one central node.

For technical effects that can be achieved by any one of the second aspect to the eleventh aspect and any possible implementation of any one of the second aspect to the eleventh aspect, refer to descriptions of technical effects that can be achieved by the first aspect and the corresponding implementations of the first aspect. Details are not described herein again.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram of a communication system to which an embodiment of this application is applicable;

FIG. 2 is a schematic flowchart of a method for training an intelligent model according to an embodiment of this application;

FIG. 3 is another schematic flowchart of a method for training an intelligent model according to an embodiment of this application;

FIG. 4 is a diagram of sharing a transmission resource by a plurality of participant nodes according to an embodiment of this application;

FIG. 5 is a block diagram of an example of a communication apparatus according to this application; and

FIG. 6 is a diagram of a structure of an example of a communication device according to this application.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

In embodiments of this application, “/” may indicate an “or” relationship between associated objects. For example, A/B may indicate A or B. “And/or” may be used to describe three relationships between associated objects. For example, A and/or B may indicate the following three cases: Only A exists, both A and B exist, and only B exists, where A and B may be singular or plural. To facilitate description of the technical solutions in embodiments of this application, in embodiments of this application, terms such as “first” and “second” may be used to distinguish between technical features having same or similar functions. The terms such as “first” and “second” do not limit a quantity and an execution sequence, and the terms such as “first” and “second” do not indicate a definite difference. In embodiments of this application, a term such as “example” or “for example” indicates an example, an illustration, or a description. Any embodiment or design scheme described as an “example” or “for example” should not be explained as being more preferred or having more advantages than another embodiment or design scheme. Use of the term such as “example” or “for example” is intended to present a relative concept in a specific manner for ease of understanding.

In embodiments of this application, “at least one (type)” may alternatively be described as “one (type) or more (types)”, and “a plurality of (types)” may be two (types), three (types), four (types), or more (types). This is not limited this application.

The following describes the technical solutions of this application with reference to the accompanying drawings.

The technical solutions in embodiments of this application may be applied to various communication systems, for example, a long term evolution (LTE) system, an LTE frequency division duplex (FDD) system, an LTE time division duplex (TDD) system, a 5th generation (5G) communication system, a future communication system (for example, a 6th generation (6G) communication system), or a system integrating a plurality of communication systems. This is not limited in embodiments of this application. 5G may also be referred to as new radio (new radio, NR).

FIG. 1 is a diagram of a communication system to which an embodiment of this application is applicable.

As shown in FIG. 1, a communication system to which an embodiment of this application is applicable may include at least one central node and a plurality of participant node groups. The central node and the plurality of participant node groups perform federated learning of an intelligent model. The intelligent model includes a plurality of feature models corresponding to a plurality of features of an inference target. Participant nodes in one of the participant node groups train one feature model. Training samples (or referred to as training data or sample data) used by participant nodes in a same participant node group to train a feature model are different, or training samples of participant nodes in a same participant node group belong to different sample spaces. Sample features of training samples used by participant nodes in different participant node groups to train feature models are different, or training samples of participant nodes in different participant node groups belong to different feature spaces. The central node and the participant nodes may decouple the intelligent model into the plurality of feature models based on different features according to a method for training an intelligent model provided in an embodiment of this application. Different participant node groups train different feature models based on corresponding feature sample data. Then, the central node aggregates gradient information fed back by the participant nodes after training, and updates parameters of the intelligent model. Model training with diversified features is implemented, and model performance can be improved.

The central node provided in embodiments of this application may be a network device, for example, a server or a base station. The central node may be a device that is deployed in a radio access network and that can directly or indirectly communicate with a participant node.

The participant node provided in embodiments of this application may be a device having a transceiver function, for example, a terminal or a terminal device. For example, the participant node may be a sensor or a device having a data collection function. The participant node may be deployed on land, including an indoor device, an outdoor device, a handheld device, and/or a vehicle-mounted device, may be deployed on a water surface (for example, on a ship), or may be deployed in air (for example, on an airplane, a balloon, or a satellite). The participant node may be user equipment (UE), and the UE includes a handheld device, a vehicle-mounted device, a wearable device, or a computing device that has a wireless communication function. For example, the UE may be a mobile phone, a tablet computer, or a computer having a wireless transceiver function. The terminal device may alternatively be a virtual reality (VR) terminal device, an augmented reality (AR) terminal device, a wireless terminal in industrial control, a wireless terminal in self-driving, a wireless terminal in telemedicine, a wireless terminal in a smart grid, a wireless terminal in a smart city, a wireless terminal in a smart home, and/or the like.

The technical solutions provided in embodiments of this application may be applied to a plurality of scenarios, for example, smart retail, smart home, video surveillance, a vehicle network (such as autonomous driving and self-driving), and an industrial wireless sensor network (IWSN). However, this application is not limited thereto.

In an implementation, the technical solutions provided in this application may be applied to a smart home, to provide a personalized service for a customer based on a customer requirement. The central node may be a base station or a server, and the participant node may be a client device disposed in each home. Based on the technical solutions provided in this application, the client device provides the server only with gradient information that is obtained through model training based on local data and that will be synthesized through a router, so that training result information can be shared with the server when customer data privacy is protected. The server obtains aggregated gradient information of the synthesized gradient information provided by a plurality of client devices, determines updated model parameters, and notifies each client device of the updated model parameters, to continue training of the intelligent model. After completing the model training, the client device uses a trained model to provide a personalized service for a customer.

In another implementation, the technical solutions provided in this application may be applied to an industrial wireless sensor network, to implement industrial intelligence. The central node may be a server, and the participant nodes may be a plurality of sensors (for example, mobile intelligent robots) in a factory. After performing model training based on local data, the sensors send synthesized gradient information to the server, and the server determines updated model parameters based on the aggregated gradient information of the synthesized gradient information provided by the sensors, and notifies each sensor of the updated model parameters, and continues to train the intelligent model. After completing model training, the sensor uses a trained model to execute a task for the factory. For example, the sensor is a mobile intelligent robot, and may obtain a moving route based on the trained model, to complete a factory transportation task, an express package sorting task, and the like.

To better understand embodiments of this application, terms used in this specification are briefly described below.

1. Artificial Intelligence (AI)

The artificial intelligence AI enables a machine to have a learning capability and accumulate experience, so that the machine may resolve a problem, such as natural language understanding, image recognition, and/or chess playing, that can be resolved by humans through experience.

2. Neural network (NN): As an important branch of artificial intelligence, a neural network is a network structure that imitates a behavior feature of an animal neural network to perform information processing. A structure of the neural network includes a large quantity of nodes (or referred to as neurons) that are connected to each other. The neural network is based on a specific operation model, and processes information by learning and training input information. One neural network includes an input layer, a hidden layer, and an output layer. The input layer is responsible for receiving an input signal, the output layer is responsible for outputting a calculation result of the neural network, and the hidden layer is responsible for complex functions such as feature expression. A function of the hidden layer is represented by a weight matrix and a corresponding activation function.

A deep neural network (DNN) is generally of a multi-layer structure. Increasing a depth and/or a width of a neural network can improve an expression capability of the neural network, and provide more powerful information extraction and abstract modeling capabilities for a complex system. The depth of the neural network may be represented as a quantity of layers of the neural network. For one layer, a width of the neural network may be represented as a quantity of neurons included in the layer.

There may be a plurality of manners of constructing the DNN, for example, including but not limited to, a recurrent neural network (RNN), a convolutional neural network (CNN), and a fully-connected neural network.

3. Training or Learning

Training is a process of processing a model (or referred to as a training model). In the processing process, parameters in the model, for example, a weighted value, are optimized, so that the model learns to execute a specific task. Embodiments of this application are applicable to but are not limited to one or more of the following training methods: supervised learning, unsupervised learning, reinforcement learning, transfer learning, and the like. Supervised learning is training using a group of training samples that have been correctly labeled. Having been correctly labeled means that each sample has an expected output value. Unlike supervised learning, unsupervised learning is a method that automatically classifies or groups input data without giving a pre-labeled training sample.

4. Inference

Inference means performing data processing by using a trained model (the trained model may be referred to as an inference model). Actual data is input into the inference model for processing, to obtain a corresponding inference result. Inference may also be referred to as prediction or decision, and the inference result may also be referred to as a prediction result, a decision result, or the like.

5. Conventional Federated Learning

Federated learning is a distributed AI training method that a training process of an AI algorithm is performed on a plurality of devices instead of being aggregated to one server, so that problems of time consumption and a large quantity of communication overheads caused by data collection during centralized AI training can be resolved. In addition, because device data does not need to be sent to a server, privacy security problems can also be reduced. A specific process is as follows: A central node sends an AI model to a plurality of participant nodes, and the participant nodes perform AI model training based on data of the participant nodes, and report an AI model trained by the participant nodes to the central node in a gradient manner. The central node aggregates gradient information fed back by the plurality of participant nodes, to obtain parameters of a new AI model. The central node may send the updated parameters of the AI model to the plurality of participant nodes, and the participant nodes perform training on the AI model again. In different federated learning processes, participant nodes selected by the central node may be the same or may be different. This is not limited in this application.

In conventional federated learning, the trained model of the plurality of participant nodes is the same as a model of the server, and types of training data used by devices participating in model training are the same, and this may be referred to as a homogeneous network. For example, in training of an image recognition model, a plurality of image collection devices may train the model by using image data collected by the plurality of image collection devices. In this manner, diversity of training data can be improved, but diversity of features of an inference target is not considered. For example, when classification is performed based on both images and audio of animals, classification between cats and dogs can be more accurate. For another example, in the Internet of Vehicles, a camera, a positioning system, and an inertia measurement unit (IMU) are configured to collect data of different types (features), to estimate a position of a vehicle or distinguish a traffic condition in a road network, so that learning performance can be improved. In addition, for different feature data, training effects of different models are different. For example, an automatic encoder combined with a classification neural network is generally used to extract and classify features of an audio signal, and a convolutional neural network is generally used to process image data. In this application, it is considered that different types of participant nodes and the central node form the heterogeneous network (heterogeneous network) for federated learning, and the different types of participant nodes respectively train sub-models corresponding to different features of the inference target in the intelligent model, so that the performance of a trained intelligent model can be improved. However, there is an association relationship (or a coupling relationship) between the different features. To separately train different sub-models by the different participant node groups, this application proposes that the central node provides, for the different types of participant nodes, an inter-feature constraint variable that represents the association relationship between the features, so as to implement inter-feature decoupling of the models, so that the different types of participant nodes can train the different feature models based on the inter-feature constraint variable and local feature data. As model parameters are updated in a model training process, the different types of participant nodes calculate a gradient of the inter-feature constraint variable, and feed back the gradient to the central node, so that the central node can update the inter-feature constraint variable based on the gradients that are of the inter-feature constraint variable and that are fed back by the different types of participant nodes. This not only implements diversity of training data in federated learning, but also implements model training for different features. In this way, the performance of the trained model is improved.

The following describes, with reference to the accompanying drawings, a method for training an intelligent model provided in embodiments of this application.

FIG. 2 is a schematic flowchart of a method for training an intelligent model according to an embodiment of this application. A central node and a plurality of participant node groups jointly perform training of the intelligent model. The intelligent model includes a plurality of feature models corresponding to a plurality of features of an inference target. A participant node in one of the participant node groups may collect training data corresponding to one feature, and train, based on the training data corresponding to the feature, a feature model corresponding to the feature. The method for training an intelligent model shown in FIG. 2 is performed by a first participant node in the plurality of participant node groups. The first participant node belongs to a first participant node group, and a model trained by a participant node in the first participant node group is a first feature model.

For example, the intelligent model jointly trained by the central node and the plurality of participant node groups includes M feature models, the M feature models are respectively trained by M participant node groups, and participant nodes in one of the participant node groups train one feature model. The first participant node group is an m^thtype of participant node group in the M participant node groups, or a participant node group corresponding to a feature model m. In other words, the first feature model is the feature model m, or is referred to as an m^thtype of feature model. The first participant node may be a kth participant node in the first participant node group. In other words, the first participant node may be referred to as the kth participant node in the m^thtype of participant node group.

FIG. 2 is a schematic flowchart of a method for training an intelligent model according to an embodiment of this application. The method includes but is not limited to the following steps.

S201: The central node sends first information to the first participant node, where the first information includes an inter-feature constraint variable, and the inter-feature constraint variable represents a constraint relationship between different features.

Correspondingly, the first participant node receives the first information from the central node, and determines the inter-feature constraint variable based on the first information.

By way of example and not limitation, the first information is broadcast information. Each participant node in a participant node group can receive the first information, and determine the inter-feature constraint variable based on the first information.

S202: The first participant node obtains first gradient information based on the inter-feature constraint variable, a model parameter of the first feature model, and first sample data by using a gradient inference model, where the first gradient information is gradient information corresponding to the inter-feature constraint variable.

The central node sends third information to the participant node in the first participant node group, where the third information indicates an updated model parameter of the first feature model. The updated model parameter is obtained by the central node based on model gradient information that is obtained through model training and that is fed back by the participant node in the first participant node group. The first participant node updates the parameter of the first feature model based on the model parameter information, to obtain the first feature model whose parameter is updated. The first participant node may train, by using the model training method provided in FIG. 3, the first feature model whose parameter is updated. For details, refer to the following descriptions of the embodiment shown in FIG. 3.

After receiving the inter-feature constraint variable, the participant node may infer, based on the inter-feature constraint variable, the model parameter of the feature model, and local sample data by using the gradient inference model, the gradient information corresponding to the inter-feature constraint variable. In this way, the central node can obtain the gradient information that corresponds to the inter-feature constraint variable and that is fed back by a participant node in one or more groups of participant node groups, and the central node can update the inter-feature constraint variable based on the obtained gradient information corresponding to the inter-feature constraint variable.

The model parameter of the first feature model may be an updated model parameter indicated by the third information that is most recently received from the central node.

In an implementation, after receiving the inter-feature constraint variable, each participant node participating in model training infers the gradient information corresponding to the inter-feature constraint variable, and the central node updates the inter-feature constraint variable based on the gradient information that corresponds to the inter-feature constraint variable and that is fed back by each participant node.

In another implementation, because participant nodes in a same group train a same model, after receiving the inter-feature constraint variable, some participant nodes participating in model training may infer the gradient information corresponding to the inter-feature constraint variable, and the central node updates the inter-feature constraint variable based on the gradient information that corresponds to the inter-feature constraint variable and that is fed back by the part of participant nodes.

The first participant node may determine, in the following manner, whether to infer, based on the inter-feature constraint variable, the model parameter of the first feature model, and the local sample data, the gradient information corresponding to the inter-feature constraint variable.

Manner 1: The central node triggers a part or all of participant nodes in the plurality of participant node groups to infer (or calculate) the gradient information of the inter-feature constraint variable based on the inter-feature constraint variable, the model parameters, and the local sample data of the participant nodes.

In other words, the central node may select a part of all of the participant nodes participating in model training to infer the gradient information corresponding to the inter-feature constraint variable. Because feature models trained by participant nodes in a same participant node group are the same, the central node may select one or more participant nodes in each participant node group to infer the gradient information corresponding to the inter-feature constraint variable. However, this application is not limited thereto. Alternatively, the central node may select, based on a relationship between different features, one or more participant nodes in some participant node groups to infer the gradient information corresponding to the inter-feature constraint variable.

In an example, the central node may send a first identifier set, where the first identifier set includes an identifier of sample data of the inter-feature constraint variable selected by the central node.

After receiving the first identifier set, the first participant node determines, based on whether a sample data set of the first participant node includes sample data corresponding to an identifier in the first identifier set, whether to infer the gradient information corresponding to the inter-feature constraint variable.

If the sample data set of the first participant node includes the sample data corresponding to the identifier in the first identifier set, for example, the sample data set of the first participant node includes the first sample data corresponding to a first identifier in the first identifier set, the first participant node infers, based on the inter-feature constraint variable, the model parameter of the first feature model, and the first sample data, the gradient information corresponding to the inter-feature constraint variable.

If the sample data set of the first participant node does not include the sample data corresponding to the identifier in the first identifier set, the first participant node does not infer the gradient information corresponding to the inter-feature constraint variable.

The other participant nodes determine, in a same manner, whether to infer the gradient information corresponding to the inter-feature constraint variable.

In another example, the central node may send inference indication information to the participant nodes, to indicate a part or all of the participant nodes to infer the gradient information corresponding to the inter-feature constraint variable.

For example, the central node may send the inference indication information to a participant node that needs to infer the gradient information corresponding to the inter-feature constraint variable, and the participant node that receives the inference indication information infers the gradient information corresponding to the inter-feature constraint variable.

For another example, the central node may broadcast the inference indication information, where the inference indication information includes identifiers of one or more participant nodes, and a participant node corresponding to the identifier included in the inference indication information infers the gradient information corresponding to the inter-feature constraint variable.

Manner 2: A same sample data selector is configured for the central node and the participant node, and the central node and the participant node may determine, based on the sample data selector, a participant node that infers the gradient information corresponding to the inter-feature constraint variable.

For example, the sample data selector may generate at least one identifier of sample data, where the sample data corresponding to the at least one identifier is used to infer the gradient information corresponding to the inter-feature constraint in a current round. If the sample data set of the first participant node includes sample data corresponding to an identifier in the at least one identifier (for example, the first sample data corresponding to the first identifier), the first participant node infers the gradient information corresponding to the inter-feature constraint variable. If the sample data set of the first participant node does not include the sample data corresponding to the identifier in the at least one identifier, the first participant node does not infer the gradient information corresponding to the inter-feature constraint variable. The other participant nodes determine, in a same manner, whether to infer the gradient information corresponding to the inter-feature constraint variable.

If the first participant node determines to infer the gradient information corresponding to the inter-feature constraint variable, the first participant node infers, based on the inter-feature constraint variable, the model parameter of the first feature model, and the first sample data, the gradient information corresponding to the inter-feature constraint variable.

For example, if a sample data identifier set used to infer the gradient information corresponding to the inter-feature constraint variable in a current round (for example, in a t^thtime of inference) is I^t, and an identifier i of the first sample data of the first participant node belongs to I^t, that is, i∈I^t, the first participant node infers, based on the inter-feature constraint variable λ_i^tcorresponding to the first sample data (that is, sample data i), the model parameter θ_m^tof the first feature model, and the first sample data by using the gradient inference model, the gradient information corresponding to the inter-feature constraint variable λ_i^t, to obtain the first gradient information g_λ^t(i, m).

$g_{λ}^{t} (i, m) = \frac{1}{N} \nabla_{λ_{i}} f (θ_{m}^{t}, b^{t}, z_{i}^{t}, λ_{i}^{t})$

g_λ^t(i, m) represents the gradient information that is of the inter-feature constraint variable λ_i^tand that is obtained by a participant node that trains the feature model m (that is, the first feature model) through inference based on the sample data (that is, the first sample data) corresponding to the identifier i in the t^thtime of inference. θ_m^tis the updated model parameter obtained by the first participant node from the central node, b^tis an offset parameter of a t^thtime of training, and z_i^tis an auxiliary variable corresponding to an i^thpiece of training data in the t^thtime of training. The offset parameter b^tand the auxiliary variable z_i^tare from the central node. ∇_λ_if(⋅) is a function used to calculate the gradient information corresponding to the model parameter.

After obtaining the gradient information corresponding to the inter-feature constraint variable, the first participant node may send quantized first target gradient information to the central node. The quantized first target gradient information may be denoted as g_λ^t(i, m).

In an implementation, the first target gradient information is the foregoing first gradient information.

After the first participant node obtains the first gradient information, the first participant node performs quantization encoding on the first gradient information, to obtain the quantized first gradient information g_λ^t(i, m), and the first participant node sends the quantized first gradient information to the central node, so that the central node obtains an updated inter-feature constraint variable λ^t+1based on the gradient information that corresponds to the inter-feature constraint variable and that is fed back by the participant node.

In another implementation, the first target gradient information includes the first gradient information and first residual gradient information, and the first residual gradient information represents a residual amount that is of gradient information corresponding to the inter-feature constraint variable and that is not sent to the central node before the first participant nodes obtains the first gradient information.

The first target gradient information {tilde over (g)}_X^t(i, m) may be represented as:

${\tilde{g}}_{λ}^{t} (i, m) = g_{λ}^{t} (i, m) + β^{t} r_{λ}^{t} (i, m)$

- where β^t=τ^t−1/τ^t, τ^tis a model parameter update step in the t^thtime of model training, that is, a learning rate of the t^thtime of model training, τ^t−1is a model parameter update step in a (t−1)^thtime of model training, that is, a learning rate of the (t−1)^thtime of model training, and r_λ^t(i, m) is the first residual gradient information, that is, the residual amount that is in the gradient information corresponding to the inter-feature constraint variable and that is not sent to the central node before the gradient information corresponding to the inter-feature constraint variable is inferred for the t^thtime.

The first participant node sends the quantized first target gradient information to the central node, and the first participant node may update the residual gradient information. To be specific, the first participant node obtains second residual gradient information based on the first target gradient information {tilde over (g)}_λ^t(i, m) and the quantized first target gradient information {tilde over (g)}_λ^t(i, m), where the second residual gradient information is a residual amount r_λ^t+1(i, m) that is of the gradient information and that is not sent to the central node before the first participant node infers the gradient information corresponding to the inter-feature constraint variable for a (t+1)^thtime.

$r_{λ}^{t + 1} (i, m) = {\tilde{g}}_{λ}^{t} (i, m) - {\overline{g}}_{λ}^{t} (i, m)$

The second residual gradient information represents the residual amount that is in the first target gradient information and that is not sent to the central node. The second residual gradient information is used as the residual amount that is of the gradient information corresponding to the inter-feature constraint variable and that is not sent to the central node before a (t+1)^thtime of model training. In other words, the first participant node sends the quantized first target gradient information to the central node, and the residual amount of the gradient information is updated to the residual amount that is in the first target gradient information and that is not sent to the central node due to quantization encoding.

Optionally, the first participant node may determine, based on a scheduling policy, whether to send the quantized target gradient information to the central node.

For example, the first participant node sends the quantized first target gradient information to the central node based on the scheduling policy. After determining to send the quantized target gradient information to the central node, the first participant node sends the quantized first target gradient information g_θ^t(m, k) to the central node.

Optionally, if the first participant node determines, based on the scheduling policy, not to send the quantized target gradient information to the central node, the first participant node determines the third residual gradient information. The third residual gradient information is the first target gradient information. In this case, the third residual gradient information is the residual amount r_λ^t+1(i, m) that is of the gradient information and that is not sent to the central node before the first participant node performs the (t+1)^thtime of model training to obtain the gradient information:

$r_{λ}^{t + 1} (i, m) = {\tilde{g}}_{λ}^{t} (i, m)$

In other words, if the first participant node determines, based on the scheduling policy, not to send the first target gradient information to the central node, the residual amount r_λ^t+1(i, m) of the gradient information includes the first gradient information g_λ^t(i, m) obtained through the t^thtime of inference and the residual amount β^tr_λ^t(i, m) that is in the gradient information, corresponding to the inter-feature constraint variable, obtained before the t^thtime of inference and that is not sent to the central node.

In an example, the scheduling policy may be notified by the central node to the first participant node.

For example, the central node sends indication information A to the first participant node, where the indication information A may indicate the first participant node to send, to the central node after the t^thtime of inference of the gradient information corresponding to the inter-feature constraint variable, gradient information obtained through training. In this case, the first participant node sends, to the central node after the t^thtime of inference, the quantized first target gradient information, and calculates the residual amount (that is, the second residual gradient information) of the gradient information after the quantized first target gradient information is sent, where r_λ^t+1(i, m) is the second residual gradient information.

Alternatively, the indication information A may indicate the first participant node not to send the obtained gradient information to the central node after the t^thtime of model inference. In this case, the first participant node does not send the gradient information to the central node after obtaining the first gradient information through the t^thtime of inference, and accumulates the first gradient information to the residual gradient information to obtain the third residual gradient information, where r_λ^t+1(i, m) is the third residual gradient information.

In another example, the scheduling policy may be determined by the first participant node based on quantization noise information, channel resource information, and the first target gradient information.

A specific implementation in which the first participant node determines the scheduling policy based on the quantization noise information, the channel resource information, and target gradient information (for example, the first target gradient information in this example) is described in detail in Embodiment 2.

If the first participant node does not include the sample data corresponding to the identifier in the sample data identifier set I^t, the first participant node does not infer the gradient information of the inter-feature constraint variable. The first participant node updates the residual gradient information, to obtain r_λ^t+1(i, m)

$r_{λ}^{t + 1} (i, m) = β^{t} r_{λ}^{t} (i, m)$

S203: The central node receives the target gradient information that corresponds to the inter-feature constraint variable and that is from a plurality of participant nodes.

Using the first participant node in the plurality of participant nodes as an example, the first participant node may send the quantized first target gradient information obtained through inference to the central node. After receiving the quantized first target gradient information from the first participant node, the central node obtains the first target gradient information through quantization decoding. The central node updates, based on the received target gradient information that corresponds to the inter-feature constraint variable and that is respectively fed back by the plurality of participant nodes, the inter-feature constraint variable corresponding to each piece of sample data.

$λ_{i}^{t + 1} = {\begin{matrix} λ_{i}^{t} + τ^{t} \frac{N}{N_{b}} (\frac{1}{M} \sum_{m \in ℳ} {\overline{g}}_{λ}^{t} (i, m)), & i \in I^{t} \\ λ_{i}^{t}, & i \notin I^{t} \end{matrix}$

N^bis a quantity of pieces of the sample data, and the central node may select N^bpieces of sample data each time. A participant node that stores the selected sample data infers, based on the model parameter of the current feature model and the sample data, the gradient information of the inter-feature constraint variable, and feeds back the gradient information to the central node. Compared with a manner in which each participant node participates in feeding back the gradient information of the inter-feature constraint variable, resource overheads and implementation complexity can be reduced.

The central node further updates an offset parameter b^t+1based on the inter-feature constraint variable obtained through calculation in each round:

$b^{t + 1} = b^{t} - η^{t} g_{b}^{t}$

$where g_{b}^{t} = \nabla r (b^{t}) + \frac{1}{N_{b}} \sum_{i \in I^{t}} λ_{i}^{t} \cdot r (\cdot)$

is a block-separable regularization function.

In addition, the central node further updates an auxiliary variable z_i^t+1based on the inter-feature constraint variable obtained through update in each round:

$z_{i}^{t + 1} = {\begin{matrix} z_{i}^{t} - η^{t} g_{b}^{t} \frac{N}{N_{b}} g_{z_{i}}^{t}, & i \in I^{t} \\ z_{i}^{t}, & i \notin I^{t} \end{matrix}$

$where g_{z_{i}}^{t} = \frac{1}{N} (\nabla l_{i} (z_{i}^{t}) - λ_{i}^{t}),$

and l_i(⋅) is a sampling loss function of an i^thdata sample.

After obtaining the updated inter-feature constraint variable, the central node sends the updated offset parameter b^t+1, the auxiliary variable z_i^t+1, and the inter-feature constraint variable to the participant node, so that the participant node trains the feature model based on the offset parameter, the auxiliary variable, and the inter-feature constraint variable.

FIG. 3 is another schematic flowchart of a method for training an intelligent model according to an embodiment of this application. The method includes but is not limited to the following steps.

S301: A central node sends first information to a first participant node, where the first information includes an inter-feature constraint variable, and the inter-feature constraint variable represents a constraint relationship between different features.

Correspondingly, the first participant node receives the first information from the central node, and determines the inter-feature constraint variable based on the first information.

S302: The first participant node trains a first feature model based on the inter-feature constraint variable and model training data, to obtain second gradient information.

After obtaining the inter-feature constraint variable, the first participant node performs, based on the inter-feature constraint variable and the model training data, a t^thtime of model training on the first feature model.

The central node further sends third information to a participant node in the first participant node group, where the third information indicates an updated model parameter of the first feature model. The updated model parameter is obtained by the central node based on model gradient information that is obtained through a (t−1)^thtime of model training and that is fed back by the participant node in the first participant node group.

The first participant node updates the parameter of the first intelligent model based on the model parameter information, to obtain an updated first intelligent model. The first participant node then performs the t^thtime of model training, and trains, based on the inter-feature constraint variable and the model training data, the first intelligent model whose parameter is updated. The first participant node obtains the second gradient information after training the first feature model for the t^thtime.

For example, the second gradient information may be denoted as g_θ^t(m, k), and indicates gradient information obtained by a kth participant node (that is, the first participant node) in m types of participant node groups through the t^thtime of training. The second gradient information g_θ^t(m, k) may be represented as:

$g_{θ}^{t} (m, k) = \frac{1}{M} \sum_{i \in I_{m, k}^{t}} (\nabla_{θ_{m}} f (θ_{m}^{t}, b^{t}, z_{i}^{t}, λ_{i}^{t}))$

- where l_m,k^tis an index value set that is of training data and that is selected by the first participant node in the t^thtime of training. θ_m^tis the updated model parameter obtained by the first participant node from the central node, λ_i^tis the inter-feature constraint variable, b^tis an offset parameter of the t^thtime of training, and z_i^tis an auxiliary variable corresponding to an i^thpiece of training data in the t^thtime of training. The offset parameter b^tand the auxiliary variable z_i^tare obtained by the first participant node from the central node. ∇_θ_mf(⋅) is a function used to calculate gradient information corresponding to the model parameter. The second gradient information is used to update the model parameter of the first feature model. Specifically, the first participant node feeds back the gradient information to the central node, and the central node determines the updated model parameter based on the gradient information fed back by the participant node in the first participant node group.

After training the first feature model based on training data corresponding to each index value in the index value set of the training data, the first participant node obtains one piece of gradient information, and accumulates gradient information obtained by training the model based on each piece of training data. Because the first intelligent model is one of M feature models of the intelligent model, after the gradient information obtained through accumulation is divided by M, the second gradient information g_θ^t(m, k) obtained through the t^thtime of training of the first intelligent model by the first participant node is obtained. However, this application is not limited thereto.

After performing training on the first intelligent model for the t^thtime, the first participant node may send quantized second target gradient information to the central node. The quantized second target gradient information may be denoted as g_θ^t(m, k).

In an implementation, the second target gradient information is the foregoing second gradient information.

The first participant node obtains the second gradient information after performing the t^thtime of model training on the first intelligent model, and the first participant node performs quantization encoding on the second gradient information, to obtain the quantized second gradient information g₀^t(m, k), and sends the quantized second gradient information to the central node. In this way, the central node obtains the updated parameter of the first feature model based on the gradient information that is obtained through the t^thtime of training and that is fed back by the participant node in the first participant node group.

In another implementation, the second target gradient information includes the second gradient information and fourth residual gradient information, and the fourth residual gradient information represents a residual amount that is of gradient information and that is not sent to the central node before the first participant node obtains the second gradient information.

The second target gradient information {tilde over (g)}₀^t(m, k) may be represented as:

$r_{θ}^{t + 1} (m, k) = {\tilde{g}}_{θ}^{t} (m, k) - {\overline{g}}_{θ}^{t} (m, k)$

- where α^t=η^t−1/η^t, η^tis a model parameter update step in the t^thtime of model training, that is, a learning rate of the t^thtime of model training, η^t−1is a model parameter update step in a (t−1)^thtime of model training, that is, a learning rate of the (t−1)^thtime of model training, and r_θ^t(m, k) is the fourth residual gradient information, that is, the residual amount of gradient information that is in the gradient information obtained through model training before the t^thtime of model training and that is not sent to the central node.

After the first participant node sends the quantized second target gradient information to the central node, the first participant node may update residual gradient information. To be specific, the first participant node obtains, based on the second target gradient information {tilde over (g)}_θ^t(m, k) and the quantized second target gradient information g_θ^t(m, k), that fifth residual gradient information is a residual amount r_θ^t+1(m, k) that is of gradient information and that is not sent to the central node before the first participant node performs a (t+1)^thtime of model training to obtain gradient information.

${\tilde{g}}_{θ}^{t} (m, k) = g_{θ}^{t} (m, k) + α^{t} r_{θ}^{t} (m, k)$

The fifth residual gradient information represents the residual amount that is in the second target gradient information and that is not sent to the central node. The fifth residual gradient information is used as the residual amount of gradient information that is in the gradient information obtained through model training before the (t+1)^thtime of model training and that is not sent to the central node. In other words, the first participant node sends the quantized second target gradient information to the central node, and the residual amount of the gradient information is the residual amount that is in the second target gradient information and that is not sent to the central node due to quantization encoding.

Optionally, the first participant node may determine, based on a scheduling policy, whether to send the quantized target gradient information to the central node.

For example, the first participant node sends the quantized second target gradient information to the central node based on the scheduling policy. After determining to send the quantized target gradient information to the central node, the first participant node sends the quantized second target gradient information g_θ^t(m, k) to the central node.

Optionally, if the first participant node determines, based on the scheduling policy, not to send the quantized target gradient information to the central node, the first participant node determines sixth residual gradient information. The sixth residual gradient information is the second target gradient information. In this case, the sixth residual gradient information is the residual amount r_θ^t+1(m, k) that is of the gradient information and that is not sent to the central node before the first participant node performs the (t+1)^thtime of model training to obtain the gradient information:

$r_{θ}^{t + 1} (m, k) = {\tilde{g}}_{θ}^{t} (m, k)$

In other words, if the first participant node determines, based on the scheduling policy, not to send the second target gradient information to the central node, the residual amount r_θ^t+1(m, k) of the gradient information includes the second gradient information g_θ^t(m, k) obtained through the t^thtime of model training and the residual amount α^tr_θ^t(m, k) that is not sent to the central node before the second gradient information is obtained through the t^thtime of model training.

In an example, the scheduling policy may be notified by the central node to the first participant node.

For example, the central node sends indication information A to the first participant node, where the indication information A may indicate the first participant node to send, to the central node after the t^thtime of model training, the gradient information obtained through training. In this case, the first participant node sends the quantized second target gradient information to the central node after the t^thtime of training, and calculates the residual amount (that is, the fifth residual gradient information) of the gradient information after the quantized second target gradient information is sent, where r_θ^t+1(m, k) is the fifth residual gradient information.

Alternatively, the indication information A may indicate the first participant node not to send, to the central node after the t^thtime of model training, the gradient information obtained through training. In this case, the first participant node does not send the gradient information to the central node after obtaining the second gradient information through the t^thtime of training, and accumulates the second gradient information to the residual gradient information to obtain the sixth residual gradient information, where r_θ^t+1(m, k) is the sixth residual gradient information.

In another example, the scheduling policy may be determined by the first participant node based on quantization noise information, channel resource information, and the second target gradient information.

A specific implementation in which the first participant node determines the scheduling policy based on the quantization noise information, the channel resource information, and the target gradient information is described in detail in Embodiment 2.

The central node receives quantized target gradient information that is sent after the participant node in the first participant node group performs the t^thtime of model training,

$\begin{matrix} θ_{m}^{t + 1} = θ_{m}^{t} - η^{t} (\frac{1}{N^{b}} \sum_{k \in 𝒦_{m}} {\overline{g}}_{θ}^{t} (m, k)) & Embodiment 2 \end{matrix}$

$where 𝒦_{m} = {1, ..., K_{m}}, and N^{b} is a quantity of data samples .$

This embodiment of this application provides a manner in which a participant node determines a scheduling policy based on channel resource information and a to-be-transmitted signal {tilde over (g)}(m, k). The to-be-transmitted signal {tilde over (g)}(m, k) may be the foregoing second target gradient information {tilde over (g)}_θ^t(m, k), and the scheduling policy is used by the first participant node to determine whether to send the second target gradient information to a central node. The to-be-transmitted signal {tilde over (g)}(m, k) may be the foregoing first target gradient information {tilde over (g)}_λ^t(m, k), and the scheduling policy is used by the first participant node to determine whether to send the first target gradient information to the central node. However, this application is not limited thereto, and the scheduling policy may also be used in a decision of whether to transmit another to-be-transmitted signal.

By way of example and not limitation, the channel resource information includes channel state information and/or transmission time information.

The channel state information is state information h_kof a channel between the first participant node and the central node, and the transmission time information is duration T₀in which the first participant node occupies a channel resource to transmit gradient information.

The first participant node may determine quantization noise information based on the channel resource information and target information, where the quantization noise information represents a loss introduced by quantization encoding and decoding on the target information. For example, the target information may be the foregoing first target gradient information {tilde over (g)}_λ^t(m, k), or may be the foregoing second target gradient information {tilde over (g)}_θ^t(m, k). Alternatively, the target information may be other information. This is not limited in this application.

For example, as shown in FIG. 4, the first participant node performs quantization encoding on to-be-sent target information {tilde over (g)} by using a quantization encoding module, to obtain quantized target information g. The first participant node sends the quantized target information g to the central node by using a transceiver module of the first participant node. The quantized target information is received by the central node by using a transceiver module of the central node after channel propagation, and the quantized target information received by the central node may be denoted as g′. The central node performs quantization decoding on g′ by using a quantization decoder, to obtain recovered target information {tilde over (g)}′. A loss introduced by the recovered {tilde over (g)}′ relative to {tilde over (g)} is the quantization noise.

In an example, the first participant node may perform quantization encoding on the target information and then perform quantization decoding on the target information, to obtain quantization noise information, where the quantization noise information is a difference between the target information and a signal obtained through encoding and then decoding on the target information.

In another example, the first participant node may estimate the quantization noise information of the target information based on the channel resource information and the target information.

In other words, the first participant node may estimate, based on the obtained channel resource information, the loss of the target information after the quantized target information undergoes channel transmission and quantization decoding that is performed by the central node, to obtain the quantization noise information. Optionally, the channel resource information may include the channel state information and/or channel occupation time information (that is, the transmission time information).

Optionally, the first participant node obtains the quantization noise information of the target information based on the channel resource information, communication cost information, and the target information. The communication cost information indicates a communication cost weight of a communication resource, and the communication resource may include transmission power and/or transmission bandwidth.

Optionally, the central node may send second information to the first participant node, where the second information indicates the communication cost information. Correspondingly, the first participant node receives the second information from the central node, and determines the communication cost information based on the second information.

For example, the communication cost information may indicate a cost weight γ_pof the transmission power and a cost weight γ_Bof the transmission bandwidth. The first participant node obtains the transmission communication cost information based on the second information from the central node, and may obtain a parameter q_kthrough calculation based on the transmission power cost weight γ_p, the channel resource information, that is, the channel state information h_kand the transmission time information T₀, and a noise power spectrum density N₀. The parameter q_kmeets:

$q_{k} = \frac{T_{0} {❘ h_{k} ❘}^{2}}{N_{0} γ_{p}}$

The first participant node may further solve the following formula to obtain a parameter u_k*:

$u_{k}^{*} T_{0} \log (u_{k}^{*} q_{k}) - u_{k}^{*} T_{0} + \frac{T_{0}}{q_{k}} - γ_{B} = 0$

The first participant node obtains the quantization noise information based on the parameter u_k* and the target information {tilde over (g)}(m, k), where the quantization noise information is a covariance matrix of the quantization noise of the target information.

$Q_{k}^{*} = \frac{1}{2} ({(V_{k}^{2} + 2 u_{k}^{*} V_{k})}^{\frac{1}{2}} - V_{k})$

- where V_kis the covariance matrix of the target information {tilde over (g)}(m, k).

The first participant node may determine the transmission bandwidth B_k* based on the quantization noise information, the communication cost information, the channel resource information, and the target information, where B_k* meets:

$B_{k}^{*} = \frac{\log \det (1 + {V_{k} (Q_{k}^{*})}^{- 1})}{2 T_{0} \log (u_{k}^{*} q_{k})}$

- where I is an identity matrix, det(A) represents a determinant of a matrix A, and log(x) represents calculating a logarithm of x.

In addition, the first participant node may determine the transmission power p_k* based on the quantization noise information, the communication cost information, the channel resource information, and the target information, where p_k* meets:

$p_{k}^{*} = \frac{(u_{k}^{*} q_{k} - 1) \log \det (I + {V_{k} (Q_{k}^{*})}^{- 1})}{2 γ_{p} q_{k} \log (u_{k}^{*} q_{k})}$

After obtaining the transmission bandwidth B_k* and the transmission power p_k*, the first participant node may determine a threshold f_k* based on the transmission bandwidth B_k*, the transmission power p_k*, the quantization noise information Q_k* of the target information, and the communication cost, where the threshold f_k* meets:

$f_{k}^{*} = t r (Q_{k}^{*}) + γ_{p} p_{k}^{*} + γ_{B} B_{k}^{*}$

- where tr(A) represents a trace of the matrix A. In linear algebra, a sum of elements on a main diagonal (a diagonal from upper left corner to lower right) of an n×n matrix A is referred to as a trace (or a trace) of the matrix A, and is generally denoted as tr(A).

The first participant node may compare a metric value of the target information {tilde over (g)}(m, k) with the threshold, and determine whether to send the quantized target information to the central node.

By way of example and not limitation, the metric value of the target information may be a norm ∥{tilde over (g)}(m, k)∥₂of the target information.

If the target information is a vector, the norm ∥{tilde over (g)}(m, k)∥₂of the target information is an l₂norm of the target information; or if the target information is a matrix, the norm ∥{tilde over (g)}(m, k)∥₂of the target information is a Frobeius norm of the target information.

When the metric value of the target information is greater than the threshold f_k*, for example, ∥{tilde over (g)}_k∥2>f_k*, the first participant node sends the quantized target information to the central node, that is, the first participant node is in an active state; and otherwise, when the metric value of the target information is less than or equal to the threshold f_k*, for example, ∥{tilde over (g)}_k∥₂≤f_k*, the first participant node does not send the target information to the central node, that is, the first participant node is in an inactive state.

In the embodiments shown in FIG. 2 and FIG. 3, the participant node may determine, based on the scheduling policy provided in Embodiment 2, whether to send the quantized target information to the central node.

For example, in the example shown in FIG. 2, the first participant node determines the first threshold based on the first quantization noise information and the channel resource information, where the first quantization noise information represents the loss introduced by the quantization encoding and decoding on the first target gradient information For example, with reference to the foregoing descriptions, the first participant node may obtain the first quantization noise information based on the parameter u_k* and the first target information {tilde over (g)}_λ^t(i, m). If the metric value of the first target gradient information is greater than the first threshold, the first participant node sends the quantized first target gradient information to the central node; or if the metric value of the first target gradient information is less than or equal to the first threshold, the first participant node does not send the quantized first target gradient information to the central node.

For another example, in the example shown in FIG. 3, the first participant node determines the second threshold based on the second quantization noise information and the channel resource information, where the second quantization noise information represents the loss introduced by the quantization encoding and decoding on the second target gradient information. For example, with reference to the foregoing descriptions, the first participant node may obtain the second quantization noise information based on the parameter u_k* and the second target information {tilde over (g)}_θ^t(m, k). If the metric value of the second target gradient information is greater than the second threshold, the first participant node sends the quantized second target gradient information to the central node; or if the metric value of the second target gradient information is less than or equal to the second threshold, the first participant node does not send the quantized second target gradient information to the central node.

In the examples of this application, unless otherwise stated or there is a logic conflict, terms and/or descriptions in different examples are consistent and may be mutually referenced, and technical features in different examples may be combined based on an internal logical relationship thereof, to form a new example.

In this application, a node involved may perform some or all steps or operations related to the node. These steps or operations are merely examples. In this application, other operations or variants of various operations may be further performed. In addition, the steps may be performed in a sequence different from a sequence presented in this application, and it is possible that not all the operations in this application need to be performed.

The methods provided in embodiments of this application are described above in detail with reference to FIG. 2 to FIG. 4. The following describes in detail apparatuses provided in embodiments of this application. To implement the functions in the foregoing methods provided in embodiments of this application, each network element may include a hardware structure and/or a software module, and implement the foregoing functions in a form of the hardware structure, the software module, or a combination of the hardware structure and the software module. Whether a function in the foregoing functions is performed in the manner of the hardware structure, the software module, or the combination of the hardware structure and the software module depends on particular applications and design constraint conditions of the technical solutions.

FIG. 5 is a block diagram of an apparatus for training an intelligent model according to an embodiment of this application. As shown in FIG. 5, the apparatus 500 for training an intelligent model may include a processing unit 510 and a transceiver unit 520.

In a possible design, the apparatus 500 for training an intelligent model may correspond to the participant node in the foregoing method embodiments, or may be configured on (or used in) a chip in the participant node, or may be another apparatus, module, circuit, unit, or the like that can implement a method performed by the participant node.

It should be understood that, the apparatus 500 for training an intelligent model may correspond to the participant node in the methods in embodiments of this application, and the apparatus 500 for training an intelligent model may include units in a first device configured to perform the methods shown in FIG. 2 and FIG. 3. In addition, the units in the apparatus 500 for training an intelligent model and the foregoing other operations and/or functions are respectively used to implement corresponding procedures of the methods shown in FIG. 2 and FIG. 3.

When the apparatus 500 for training an intelligent model is configured to implement the corresponding procedures performed by the participant node in the foregoing method embodiments, the transceiver unit 520 is configured to receive first information from the central node, where the first information indicates an inter-feature constraint variable, and the inter-feature constraint variable represents a constraint relationship between different features. The processing unit 510 is configured to obtain first gradient information based on the inter-feature constraint variable, a model parameter of the first feature model, and first sample data by using a gradient inference model, where the first gradient information is gradient information corresponding to the inter-feature constraint variable. The processing unit 510 is further configured to send the first gradient information to the central node.

Optionally, the processing unit 510 is further configured to determine a threshold based on quantization noise information and channel resource information, where the quantization noise information represents a loss introduced by quantization encoding and decoding on target information. The transceiver unit 520 is further configured to: if a metric value of the target information is greater than the threshold, send quantized target information; or if the metric value of the target information is less than or equal to the threshold, skip sending the quantized target information.

It should be further understood that, when the apparatus 500 for training an intelligent model is the chip configured on (or used in) the participant node, the transceiver unit 520 in the apparatus 500 for training an intelligent model may be an input/output interface or a circuit of the chip, and the processing unit 510 of the apparatus 500 for training an intelligent model may be a logic circuit in the chip.

In another possible design, the apparatus 500 for training an intelligent model may correspond to the central node in the foregoing method embodiments, for example, a chip configured on (or used in) the central node, or another apparatus, module, circuit, or unit that can implement a method performed by the central node.

It should be understood that, the apparatus 500 for training an intelligent model may correspond to the central node in the methods shown in FIG. 2 and FIG. 3, and the apparatus 500 for training an intelligent model may include units of the central node configured to perform the methods shown in FIG. 2 and FIG. 3. In addition, the units in the apparatus 500 for training an intelligent model and the foregoing other operations and/or functions are respectively used to implement corresponding procedures of the methods shown in FIG. 2 and FIG. 3.

When the apparatus 500 for training an intelligent model is configured to implement the corresponding procedures performed by the central node in the foregoing method embodiments, the processing unit 510 is configured to determine an inter-feature constraint variable, where the inter-feature constraint variable represents a constraint relationship between different features; and the transceiver unit 520 is configured to send first information to a participant node in the plurality of participant node groups, where the first information includes the inter-feature constraint variable.

It should be further understood that, when the apparatus 500 for training an intelligent model is the chip configured on (or used in) the central node, the transceiver unit 520 in the apparatus 500 for training an intelligent model may be an input/output interface or a circuit of the chip, and the processing unit 510 of the apparatus 500 for training an intelligent model may be a logic circuit in the chip. Optionally, the apparatus 500 for training an intelligent model may further include a storage unit 530. The storage unit 530 may be configured to store instructions or data. The processing unit 510 may execute the instructions or the data stored in the storage unit, so that the apparatus for training an intelligent model implements a corresponding operation.

It should be understood that, the transceiver unit 520 in the apparatus 500 for training an intelligent model may be implemented by using a communication interface (for example, a transceiver or an input/output interface), for example, may correspond to a transceiver 610 in a communication device 600 shown in FIG. 6. The processing unit 510 in the apparatus 500 for training an intelligent model may be implemented by using at least one processor, for example, may correspond to a processor 620 in the communication device 600 shown in FIG. 6. The processing unit 510 in the apparatus 500 for training an intelligent model may be further implemented by using at least one logic circuit. The storage unit 530 in the apparatus 500 for training an intelligent model may correspond to a memory in the communication device 600 shown in FIG. 6.

It should be further understood that, a specific process in which the units perform the foregoing corresponding steps is described in detail in the foregoing method embodiments, and for brevity, details are not described herein again.

FIG. 6 is a diagram of a structure of a terminal device 600 according to an embodiment of this application.

The communication device 600 may correspond to the participant node in the foregoing method embodiments. As shown in FIG. 6, the participant node 600 includes a processor 620 and a transceiver 610. Optionally, the participant node 600 further includes a memory. The processor 620, the transceiver 610, and the memory may communicate with each other through an internal connection path, to transmit a control signal and/or a data signal. The memory is configured to store a computer program, and the processor 620 is configured to execute the computer program in the memory, to control the transceiver 610 to receive and send a signal.

It should be understood that, the communication device 600 shown in FIG. 6 can implement the processes related to the participant node in the method embodiments shown in FIG. 2 and FIG. 3. The operations and/or the functions of the modules in the participant node 600 are respectively used for implementing the corresponding procedures in the foregoing method embodiments. For details, refer to the descriptions in the foregoing method embodiments. To avoid repetition, detailed descriptions are appropriately omitted herein.

The communication device 600 may be corresponding to the central node in the foregoing method embodiment. As shown in FIG. 6, the central node 600 includes a processor 620 and a transceiver 610. Optionally, the central node 600 further includes a memory. The processor 620, the transceiver 610, and the memory may communicate with each other through an internal connection path, to transmit a control signal and/or a data signal. The memory is configured to store a computer program, and the processor 620 is configured to execute the computer program in the memory, to control the transceiver 610 to receive and send a signal.

It should be understood that, the communication device 600 shown in FIG. 6 can implement the processes related to the central node in the method embodiments shown in FIG. 2 and FIG. 3. The operations and/or the functions of the modules in the central node 600 are respectively used for implementing the corresponding procedures in the foregoing method embodiments. For details, refer to the descriptions in the foregoing method embodiments. To avoid repetition, detailed descriptions are appropriately omitted herein.

The processor 620 and the memory may be integrated into one processing apparatus. The processor 620 is configured to execute program code stored in the memory to implement the foregoing functions. During specific implementation, the memory may alternatively be integrated into the processor 620, or may be independent of the processor 620. The processor 620 may correspond to the processing unit in FIG. 5.

The transceiver 610 may correspond to the transceiver unit in FIG. 5. The transceiver 610 may include a receiver (or referred to as a receiver machine or a receiver circuit) and a transmitter (or referred to as a transmitter machine or a transmitter circuit). The receiver is configured to receive a signal, and the transmitter is configured to transmit a signal.

It should be understood that, the communication device 600 shown in FIG. 6 can implement the processes related to the terminal device in the method embodiments shown in FIG. 2 and FIG. 3. The operations and/or the functions of the modules in the terminal device 600 are respectively used for implementing the corresponding procedures in the foregoing method embodiments. For details, refer to the descriptions in the foregoing method embodiments. To avoid repetition, detailed descriptions are appropriately omitted herein.

An embodiment of this application further provides a processing apparatus, including a processor and a (communication) interface. The processor is configured to perform the method in any one of the foregoing method embodiments.

It should be understood that, the processing apparatus may be one or more chips. For example, the processing apparatus may be a field programmable gate array (FPGA), an application-specific integrated chip (ASIC), a system on chip (SoC), a central processing unit (CPU), a network processor (NP), a digital signal processing circuit (DSP), a microcontroller (MCU), a programmable controller (PLD), or another integrated chip.

Based on the methods provided in embodiments of this application, this application further provides a computer program product. The computer program product includes computer program code. When the computer program code is executed by one or more processors, an apparatus including the processor is enabled to perform the methods in the embodiments shown in FIG. 2 and FIG. 3.

All or some of the technical solutions provided in embodiments of this application may be implemented through software, hardware, firmware, or any combination thereof. When software is used to implement the embodiments, all or some of the embodiments may be implemented in a form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the procedures or functions according to embodiments of the present invention are all or partially generated. The computer may be a general-purpose computer, a dedicated computer, a computer network, a network device, a terminal device, a core network device, a machine learning device, or another programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or may be transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center in a wired (for example, a coaxial cable, an optical fiber, or a data subscriber line (digital subscriber line, DSL)) or wireless (for example, infrared, wireless, or microwave) manner. The computer-readable storage medium may be any usable medium accessible by the computer, or a data storage device, for example, a server or a data center, integrating one or more usable media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a digital video disc (digital video disc, DVD)), a semiconductor medium, or the like.

Based on the methods provided in embodiments of this application, this application further provides a computer-readable storage medium. The computer-readable storage medium stores program code. When the program code is run by one or more processors, an apparatus including the processor is enabled to perform the methods in the embodiments shown in FIG. 2 and FIG. 3.

Based on the methods provided in embodiments of this application, this application further provides a system, including the foregoing one or more first devices. The system may further include the foregoing one or more second devices.

Optionally, the first device may be a network device or a terminal device, and the second device may be a device that communicates with the first device through a radio link.

In several embodiments provided in this application, it should be understood that the disclosed system, apparatus, and method may be implemented in another manner. For example, the foregoing described apparatus embodiments are merely examples. For example, division into the units is merely logic function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented through some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected based on actual requirements to achieve the objectives of the solutions of embodiments.

The foregoing descriptions are merely specific implementations of this application, but are not intended to limit the protection scope of this application. Any variation or replacement readily figured out by a person skilled in the art within the technical scope disclosed in this application shall fall within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

	Number	Date	Country
Parent	PCT/CN2022/140797	Dec 2022	WO
Child	18750688		US

METHOD AND APPARATUS FOR TRAINING INTELLIGENT MODEL

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)

CROSS-REFERENCE TO RELATED APPLICATIONS

Continuations (1)