The invention relates to the field of deep neural network training on data sets, in particular to an energy-efficient sample selection method based on complexity.
With the resurgence of deep neural network architectures and the improvement of GPU computing power, deep neural networks have shown amazing performance in many computer vision tasks. However, training deep neural networks on large-scale datasets is inefficient. The reasons are summarized as follows: first of all, the scale of the neural network is getting deeper and deeper, and the whole network has more than millions of parameters, the explosive growth of the scale of model makes the training of the neural network difficult; second, the training of deep neural network requires a large number of labeled data samples to update the model weights. Therefore, training deep neural network on large-scale datasets is inefficient, and the training process requires higher computing power.
The technical problem to be solved by the invention is that the deep neural network training on large-scale data sets needs to consume higher computing power and energy consumption, and the training efficiency is low.
In order to solve the above technical problems, the invention provides a technical scheme: an energy-efficient sample selection method based on complexity, which performs sample selection on the raw data sets through two stages of inter-class sampling and intra-class sampling to achieve the object of constructing lightweight data set for model training. Wherein:
Compared with the prior art, the invention has the following advantages: the invention proposes an energy-efficient sample selection method based on complexity, selects representative samples from large-scale datasets for efficient model training, and proves that sample complexity and model training strategies have a very important impact on the efficient training of deep neural networks, the invention solves the problem of low efficiency of model training based on sample complexity and model training strategies, the object of the energy-efficient sample selection method based on complexity is to select representative samples from large-scale data sets, thereby reducing the number of samples used for model training and achieving the object of lightweight training.
In inter-class sampling, all samples in the inverse diverse self-paced learning data setD={(xi, ci)} can be quantified by the loss value lossi given by the model pre-trained by the inverse self-paced learning, denoted as Ds={(xi, yi, lossi)}i=1i=k wherein yi∈C is the label p p=1 information and lossi is the training loss of the sample xi.
In intra-class sampling, samples within each class are iteratively selected based on the density distribution of the samples. The sampling rate refers to the proportion of samples selected from each class, for each iteration, density-based clustering is performed to connect regions in the sample set into clusters and exclude those noise samples that do not belong to the clusters. Considering that there may be significant differences in the loss distribution of clustering, therefore, the invention uses the mean shift algorithm to automatically find the number of clusters cNum and cluster centers cCenters, and use the number threshold to set the sampling strategy; when the number of samples in cluster j ∥cSample(j)∥ is greater than the threshold, it indicates that the cluster is dense and the number of samples is large; at the same time, in order to reduce the samples used for model training, a density-based Monte Carlo sampling algorithm is used to select representative samples from the cluster; for a cluster with fewer samples, all samples in the cluster are directly added to Ψ.
Further, the Monte Carlo sampling algorithm is as follows:
In the embodiment of the invention, the invention provides a technical scheme: a energy-efficient sample selection method based on complexity, which performs sample selection on the raw data sets through two stages of inter-class sampling and intra-class sampling to achieve the object of constructing lightweight data set for model training. Wherein:
The energy-efficient sample selection method based on complexity is a framework, which comprises the two stages of inter-class sampling and intra-class sampling. First, the inter-class sampling uses the reverse self-paced learning method to increase the learning of difficult samples of various classes, and realizes the adaptive adjustment of the learning weight of difficult samples. Then, intra-class sampling retains the difficult samples in each class and downsamples the easy samples through a clustering sampling algorithm based on the difficulty of the data samples. Finally, a lightweight data set is obtained.
In inter-class sampling, all samples in the inverse diverse self-paced learning data setD={(xi, ci)} can be quantified by the loss value lossi given by the model pre-trained by the inverse self-paced learning, denoted as Ds={(xi, yi, lossi)}i=1i=k, wherein yi∈C is the label information and lossi is the training loss of the sample xi.
In intra-class sampling, samples within each class are iteratively selected based on the density distribution of the samples. The sampling rate ζ refers to the proportion of samples selected from each class, for each iteration, density-based clustering is performed to connect regions in the sample set into clusters and exclude those noise samples that do not belong to the clusters. Considering that there may be significant differences in the loss distribution of clustering, therefore, the invention uses the mean shift algorithm to automatically find the number of clusters cNum and cluster centers cCenters, and use the number threshold to set the sampling strategy; when the number of samples in cluster j ∥cSample(j)∥ is greater than the threshold, it indicates that the cluster is dense and the number of samples is large; at the same time, in order to reduce the samples used for model training, a density-based Monte Carlo sampling algorithm is used to select representative samples from the cluster; for a cluster with fewer samples, all samples in the cluster are directly added to T.
The pseudo code for Monte Carlo sampling is shown in Algorithm 4. The input parameters of Algorithm 4 comprise the cluster center Center, the loss distribution for the given cluster samples Samples, and the sampling rate ζ that controls how many samples are selected from each cluster. In Algorithm 4, our object is to synthesize the selected samples from each cluster into a selected sample data set R. First, we decide how many samples should be selected from a cluster by sampleNumber=|(|Samples|)|×ζ. To complete the sampling, we initialize x(0) with cluster centers instead of random prior distributions to overcome getting stuck in local optima. In the main loop of Algorithm 4, first, a candidate solution xcand is generated from the prior distribution q(x(i)|xi-1), and the acceptance probability of xcand for the prior state x(i-1) is expressed as α(xcand|x(i-1)), then, calculating based on the prior distribution and the joint probability density w(*). Finally, comparing the acceptance probability a with a continuous uniform distribution u over the interval [0,1]. If u<α, the candidate solution is accepted and added to R as the selected sample. This process is repeated continuously until the size of R reaches the sampling threshold SampleNumber, and the algorithm stops.
The basic principles, main features and advantages of the invention are shown and described above. Those skilled in the art should understand that the invention is not limited by the above embodiments, the descriptions in the above embodiments and the specification are only for illustrating the principle of the invention, without departing from the spirit and scope of the invention, the invention may have various changes and improvements, and these changes and improvements all fall within the protection scope of the invention. The protection scope of the invention is defined by the appended claims and their equivalents.
| Number | Date | Country | Kind |
|---|---|---|---|
| 202210727797.X | Jun 2022 | CN | national |