The importance of utilizing a good feature embedding in few-shot learning is well studied and addressed. A feature embedding is pre-trained as a classification task using meta-training dataset (base classes). Fine-tuning on the meta-test dataset (novel classes) is known to surpass most meta-learning methods. However, only finetuning a classifier on the meta-test dataset leaves the feature embedding unchanged.
A pre-trained feature extractor is sufficient to have well-defined feature distributions on base classes, while this is not true for novel classes. Novel classes may come from a variety of domains different from base classes. Globally, initial feature distributions of novel classes could be affected mainly by the domain difference. Locally, features are not trained to cluster tightly within a class and separate well between classes, which intensifies the biased estimation of only a few samples. Those biases in novel-class feature distributions show the importance of refining novel-class features.
Utilizing a deep network trained on a meta-training dataset serves as a strong baseline in few-shot learning. In the disclosed invention novel-class features are refined by fine-tuning the feature extractor on the meta-test set using only a few samples. Biases in novel-class feature distributions are reduced by defining them into two aspects: class-agnostic biases and class-specific biases.
Class-agnostic bias refers to the feature distribution shifting caused by domain differences between novel and base classes. The unrefined features from novel classes could cluster in some primary direction due to the domain difference, which leads to the skewed feature distribution as shown in the left graph in
Class-specific bias refers to the biased estimation using only a few samples in one class. Biased estimation is always critical for few-shot learning. By only knowing a few samples, the estimation of feature distribution within one class is biased as shown in the left portion of
The disclosed invention provide a Distribution Calibration Module (DCM) to reduce class-agnostic bias. DCM is designed to eliminate domain difference by normalizing the overall feature distribution for novel classes and further reshaping the feature manifold for fast adaptation during fine-tuning. The results are shown in the right graph of
For class-specific bias, the disclosed invention implements Selected Sampling (SS) to augment more data for better estimation. More specifically, SS happens on a chain of proposal distributions centered with each data point from the support set. The whole sampling process is guided by an acceptance criterion wherein only samples beneficial to the optimization will be selected. The results of applying SS is shown in the right portion of
By way of example, a specific exemplary embodiment of the disclosed system and method will now be described, with reference to the accompanying drawings, in which:
To explain the invention, the few-shot classification setting is formalized with notation. Let (x, y) denote an image with its ground-truth label. In few-shot learning, training datasets and testing datasets are referred to as the support and query set respectively and are collectively called a C-way K-shot episode. The training (support) dataset is denoted as s={(xi, yi)}i=1N
For supervised learning, a statistics θ*=θ*(s) is learned to classify s as measured by the cross-entropy loss:
More specifically:
The novel class prototype wc, c∈C is the mean feature from the support set s:
Herein, fθ(x) is first pre-trained with the meta-training dataset using cross-entropy loss and further, in each testing episode, θ*=θ*(s) is learned by finetuning fθ(x) using s. Given a test datum x where (x, y)∈q, y is predicted:
With this basic finetuning framework, the Distribution Calibration Module and Selected Sampling will now be discussed.
Distribution Calibration Module—The disclosed invention inserts a plug-in distribution calibration module (DCM) following a feature extractor to reduce class-agnostic bias caused by domain differences.
The first step in reducing class-agnostic bias is to calibrate the skewed feature distribution. A pre-trained feature extractor fθ(x) provides an initial feature space wherein general invariant features are learned from a large-scale dataset. θ*=θ*(base) is sufficient to classify those base classes, which makes it inadequate for distinguishing novel classes. The overall feature distribution of novel classes may be skewed due to its domain property. The feature distribution could be described statistically:
For feature distributions that are skewed in some directions, the μ and σ presented could be far from the normal distribution. They are first applied to calibrate the distribution to approach a zero-centered mean and unit standard deviation:
This distribution calibration by feature normalization helps correct the skewed directions brought by large domain differences between base and novel classes.
Fast feature adaptation is enabled during fine-tuning. For a feature vector, there are locations encoding class-sensitive information and locations encoding common information. Values on class-sensitive locations are expected to vary between classes to distinguish them. Similar values are obtained from common locations among all samples, representing some domain information but contributing little to classification. By this normalization, those locations that encode class-sensitive features stand out compared with the locations encoding common information.
Additionally, the calibrated feature embedding is multiplied element-wise by a scale vector:
f
i
=f
θ(xi);
and
This is illustrated in
As s is multiplied element-wise with fi, only the partial derivative at a single location on the feature vector is shown. In the 1-shot case with mean features as class prototypes, the gradient is specified by:
After applying the distribution calibration on fi and fj, the gradients of s for the class-sensitive locations have relatively larger values than the common ones. The difference between features is further addressed and enlarged correspondingly through gradient-based optimization. And the feature manifolds will adapt quickly to a shape where distinguished parts are amplified.
Selected Sampling—Selective sampling is used to reduce class-specific bias. Biased estimation in a class inevitably hinders the optimization of features. The gradient for feature f of (x, y) during finetuning is:
As p(y|x)≤1, the optimization of the gradient descent focuses f moving close towards the direction of wy, its ground-truth class prototypes. For a class c, a mean feature from the support set is used as the class prototype when computing the predicted probability:
This is the empirical estimation of mean using the support set. The true mean from the class distribution is denoted as mc. The bias term δc between empirical estimation with its true value is defined as:
δc=wc−mc (9)
For few-shot learning, as the w is estimated from a small number of data, c is indeed not neglectful. As defined in Eq. (9), wy can be replaced by δc+my. Then, the gradient of feature f is given by:
The optimization of f towards its class prototype w y can be factorized into two parts: one part, (p(y|x)−1)δy, dominated by the bias and the other part, (p(y|x)−1)my, dominated by the true mean. Ideally, features are expected to tightly cluster around m for a refined feature distribution. However, (p(y|x)−1)δy in the gradient distracts the optimization of f by moving it close to the bias, which hinders its approaching to the true mean. This inevitably impedes the optimization for few-shot learning.
As shown in
Augmenting more data is an efficient way to reduce bias. If more features could be sampled and added to the sequence of computing the class prototype within one class, the effects caused by the bias are vastly reduced. However, the feature distribution is unknown for each class, which disables the direct sampling from that distribution.
Without the inferring of actual feature distribution, the selected sampling guides a Monte-Carlo sampling under a proposal distribution. By taking advantage of each known data in the support set and allowing these few samples to guide the direction of a Monte-Carlo sampling, features are directly augmented into the support set to reduce the bias in estimation. For each known data point (xi, yi) the corresponding vector in the feature space is denoted as fi. A proposal distribution Q (f′|f)=(fi, Σ) is used to sample f′i, and p(y|f) is a deterministic variable as the predicted logits from the classifier given a feature f. The sampled points are queried by the criterion p(yi|f′i)>p(yi|fi) in determination of acceptance. If accepted, f′i becomes the new starting feature point where the next sampling step is run using proposal distribution N(f′i, σ2). If rejected, the sampling process for (xi, yi) terminates. The sampling process is illustrated in
A meta-language listing of an exemplary embodiment of an algorithm for each epoch update of the selected sampling method is shown in
The proposal distribution ensures that samples are drawn from the vicinity around the known point during the process. (fi, Σ) is a multivariate Gaussian distribution centered with fi. The covariance matrix Σ is an identity matrix scaled with a hyper-parameter σ, which allows each location on features to be sampled independently. However, the proposal distribution is only a random walk process which brings no further constraints on the sampled points. With a feature f=fθ(x), the acceptance criterion is whether the sampled feature will have a more significant predicted probability of belonging to the ground-true class or not, which is if p(yi|f′i)>p(yi|fi):
where:
This criterion indicates that a sampled point is accepted under the case either closer to its class prototype or further away from other classes in the high-dimensional feature space. Either way, the accepted point is ensured to provide helpful information that avoids the disadvantages of random walk sampling. This selected sampling on the feature space allows exploration of unknown feature space while still controlling the quality of sampling to optimize.
As shown in
Disclosed herein is a novel, effective method to power fine-tuning in few-shot learning. By reducing biases in the novel-class feature distribution, the effect of fine-tuning is boosted over datasets from different domains. Without any meta-training process, the fast feature adaptation can also be achieved by better understanding biases in feature distribution for few-shot learning.
As would be realized by one of skill in the art, the disclosed method described herein can be implemented by a system comprising a processor and memory, storing software that, when executed by the processor, performs the functions comprising the method.
As would further be realized by one of skill in the art, many variations on implementations discussed herein which fall within the scope of the invention are possible. Moreover, it is to be understood that the features of the various embodiments described herein were not mutually exclusive and can exist in various combinations and permutations, even if such combinations or permutations were not made express herein, without departing from the spirit and scope of the invention. Accordingly, the method and apparatus disclosed herein are not to be taken as limitations on the invention but as an illustration thereof. The scope of the invention is defined by the claims which follow.
This application claims the benefit of U.S. Provisional Patent Application No. 63/148,392, filed Feb. 11, 2021, the contents of which are incorporated herein in their entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/015093 | 2/3/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63148392 | Feb 2021 | US |