The present application generally relates to performing a computer vision task.
Developing a reliable vision system is a fundamental challenge for robotic technologies (e.g., indoor service robots and outdoor autonomous robots) which can ensure reliable navigation even in challenging environments such as adverse weather conditions (e.g., fog, rain), poor lighting conditions (e.g., over/under exposure) or sensor degradation (e.g., blurring, noise), and can guarantee high performance in safety-critical functions. Current solutions proposed to improve model robustness usually rely on generic data augmentation techniques or employ costly test-time adaptation methods. In addition, most approaches focus on addressing a single vision task (typically, image recognition) utilising synthetic data.
It is an object of the present invention to improve on the prior art.
According to an embodiment, a method for controlling an electronic apparatus for performing a computer vision task, the method may include receiving a corrupted image from a camera, identifying a corruption type of the corrupted image using a corruption identification module, obtaining normalisation parameters associated with the estimated corruption type from a codebook, updating a computer vision model, trained to perform the task, by replacing normalisation parameters of the computer vision model with the obtained normalisation parameters, and performing the computer vision task using the updated computer vision model.
The corruption identification module may include a fast Fourier transform, FFT, model. The identifying the corruption type of the corrupted image using a corruption identification model may include extracting features from the corrupted image, retaining only features occurring with a frequency above a frequency threshold using the FFT model, determining a probability that the image in input is affected by a corruption type associated with the retained features, which is a corruption type based on a distribution of distances between the retained features and pretrained prototypical features of a known set of corruption types, and identifying the corruption type by selecting the corruption type with the highest probability.
The distribution of distances may be a distribution of Euclidian, L2, distances.
The corruption identification module may include a machine learning model trained to estimate a corruption type using a corrupted image. The identifying the corruption type of the corrupted image using a corruption identification module may include inputting the corrupted image to the machine learning model to estimate the corruption type from the corrupted image.
The machine learning model may a deep neural network model.
The computer vision model may be a neural network model. The normalisation parameters may include at least one of batch normalisation, BatchNorm, parameters or layer normalisation, LayerNorm, parameters.
The neural network model may be a convolutional neural network model.
The computer vision task may be a computer vision task selected from a list of computer vision tasks including object detection, object recognition, semantic segmentation.
The method further comprising generating the codebook by providing a pre-trained computer vision model and a training data set, wherein the training data set comprises, for each corruption type of a plurality of corruption types, a plurality of corrupted images and corresponding labels associated with the computer-vision task the model has been trained to perform, re-training the pretrained computer vision model, for each corruption type, using the plurality of corrupted images and corresponding labels by updating only normalisation layers of the pretrained computer vision model, extracting the normalisation layers of the re-trained computer vision model for each corruption type, and generating the codebook to associate each recognizable corruption type to the corresponding normalisation layers parameters.
The normalisation layers may include at least one of batch normalisation, BatchNorm, layers and layer normalisation, LayerNorm, layers.
The method may include generating the codebook by: providing a pretrained computer vision model, corrupted images, and corresponding corruption type labels estimated by the corruption identification module, updating normalisation layers of the pre-trained computer vision model based on the corrupted images and the corresponding corruption type labels using a test-time adaptation algorithm, extracting the updated normalisation layers for each estimated corruption type, and generating a codebook to associate each recognizable corruption type to the corresponding normalisation layers parameters.
According to an embodiment, an electronic apparatus for performing a computer vision task may include a memory and at least one processor connected the memory, wherein the at least one processor configured to receive a corrupted image from a camera, identify a corruption type of the corrupted image using a corruption identification module, obtain normalisation parameters associated with the estimated corruption type from a codebook, update a computer vision model, trained to perform the task, by replacing normalisation parameters of the computer vision model with the obtained normalisation parameters, and perform the computer vision task using the updated computer vision model.
The corruption identification module may include a fast Fourier transform, FFT, model. The at least one processor may extract features from the corrupted image, retain only features occurring with a frequency above a frequency threshold using the FFT model, determine a probability that the image in input is affected by a corruption type associated with the retained features, which is a corruption type based on a distribution of distances between the retained features and pretrained prototypical features of a known set of corruption types, and identify the corruption type by selecting the corruption type with the highest probability.
The distribution of distances may be a distribution of Euclidian, L2, distances.
The corruption identification module may include a machine learning model trained to estimate a corruption type using a corrupted image. The at least one processor may input the corrupted image to the machine learning model to estimate the corruption type from the corrupted image.
Implementations of the present techniques will now be described, by way of example only, with reference to the accompanying drawings, in which:
With reference to
The RVC 10 includes at least one processor and storage having instructions stored thereon that when executed by the at least one processor cause the at least one processor to perform methods such as capturing (or receiving or obtaining) a corrupted image using the camera, estimating (or identifying) a corruption type of the corrupted image using a corruption identification module, obtaining normalisation parameters associated with the estimated corruption type from a codebook, updating a computer vision model, trained to perform the task, by replacing normalisation parameters of the computer vision model with the obtained normalisation parameters, performing the task using the updated computer vision model, determining a trajectory based on a result of the computer vision task, and controlling the driving system to move the robot according to the trajectory.
The storage may be non-transitory computer readable media. When installing the instructions on the storage, the media may be transitory, i.e. a download signal.
In the room 12 of
With reference to
As will be appreciated from the following description, the computer vision system 20 comprises the steps of receiving a corrupted image from the camera 14 (
There are two main embodiments involving the corruption identification model. In one embodiment, the corruption identification module includes a fast Fourier transform, FFT. In another embodiment, the corruption identification module 22 includes a machine learning model trained to estimate the corruption type using a corrupted image.
There are also two main embodiments involving updating the computer vision model 26. In one embodiment, a codebook has been generated by using minimal training to train only normalisation layers of the computer vision model for each corruption type. In another embodiment, a codebook has been generated by using test time adaptation where no labelled training data is available. Test time adaptation updates the normalisation statistics for the codebook. In either embodiment, at inference time, the codebook is used to identify the normalisation statistics associated with the identified corruption type.
With reference to
FROST performs a 2-step approach: At training time, FROST extracts high-frequency amplitudes from corrupted images, it aggregates them for images with the same corruption, and it builds a set of per-corruption feature prototypes. Then, it estimates corruption-specific (Corr-S), or corruption specialized 34, and corruption-generic (Corr-G) normalization layer parameters (normalisation statistics) 40 starting from a pretrained model 32. When the computer vision model is a neural network, and may be specifically a convolutional neural network, the normalisation layer paramters, or normalisation parameters may include at least one of batch normalisation, BatchNorm, statistics, and layer normalisation, LayerNorm, statistics. At test time, FROST identifies corruption types {circumflex over (k)} present in the test images and uses a codebook to map such corruptions to normalization layers' parameters to minimize the recognition error. These normalization parameters
′{circumflex over (k)} come from either corruption-generic or specific model, depending on the confidence of the model.
Background. Given a computer vision model (or model) F, 32, that approximates ground truth labels y∈ of samples x∈
using a training set
={
,
}. In the current embodiment, a corrupted image is defined by {tilde over (x)}=x+ψ, where x∈
w×h×3 is a clean RGB image with the width w, height h and corruption ψ. The goal is to improve object recognition accuracy of M on the corrupted images.
Corruptions. Previous works showed that a real corruption ψ can be approximated by a combination of synthetic corruptions. A subset ={Contrast, Brightness, Defocus Blur, Glass Blur, Motion Blur, Zoom Blur, Impulse Noise, Shot Noise, Gaussian Noise} is used of the 9 most common real corruptions. A synthetic corruption for the clean image x by Ck,s(x) is defined such that x+Ck,s(x)≈{tilde over (x)} for k∈
(e.g., k=Contrast). The parameter λ∈{1,2,3,4,5} is an integer number which defines corruption intensity depending on the degradation level, with λ=1 being the lowest and λ=5 the highest.
Training. FFT features extraction. For each synthetic corruption in , we construct a set
k,5, by applying a corruption Ck,5(x) to all the images x∈↦. In this case, we use only the strongest corruption (s=5) to obtain a better separation between features for different corruptions. For each image {tilde over (k)}k=x+Ck,s(x) with corruption k, we extract an FFT feature Φk=ℑn({tilde over (x)}k) by performing FFT ℑ(⋅) on the input image, applying windowing operation to retain the first n high-frequency components of the amplitude spectrum and flattening. In particular, n is selected empirically as n=15, computing ℑ(⋅) on images resized to 64×64. Then, we average each set of features specific to corruption k to obtain a corruption prototype, 36,
with N being the size of the training set. This can be done via a running average during training with no need to store all features into memory. We call this set for different corruptions, we note that some results are very well clustered (e.g., Contrast, Brightness and Defocus Blur), while others (e.g., Blur types and Noise distortions) are hardly separable.
With reference to of FFT features. This set is originally labeled with corruption-relative labels L from
. Let us define a new labeling L* obtained through k-means. Setting empirically the number of clusters for the k-means to 5, we obtain similar clustering score as for original labels. In particular, if we group together Blur corruptions, and Noise corruptions, we obtain a new labeling L′. Comparing L* with L′ via quantitative analysis, we get an adjusted random score of 89.1% (meaning that the clusters are very similar). For this reason, we aggregate prototypes for features belonging to similar corruptions, obtaining a new set of corruptions
′={Contrast, Brightness, Defocus, Blur, Noise} with 5 macro corruptions. Also, we obtain a new set of macro prototypes
[I kept S (and not Λ because S=(γ, β, E[x], Var[x]), while Λ=(E[x], Var[x]))]
Estimation of corruption-specific statistics. We define by S the set of statistics estimated at all normalization layers (Batch/Layer Normalization) in the recognition model F. These layers are storage-friendly as they have only two parameters (scale γ and shift β) that have shown to adapt differently to input images affected by different corruptions. Therefore, our purpose is to use it to improve recognition accuracy for corrupted images. First, we train a model F (updating only normalization layers with S parameters) on performing data augmentation (
) on clean samples by adding Ck,s(⋅). Image augmentations are selected according to a uniform distribution using original corruption functions for augmentations (K=9 in total) with severe corruptions only, i.e., s˜{4,5}. With this training, we obtain corruption-generic normalization statistics
. Then, we trainF (updating only normalization layers with S parameters) on
k,{4,5} (i.e.,
corrupted with corruption k, only usings ˜{4,5}), producing K different corruption-specific sets of normalization statistics Sk. According to macro corruption grouping
′, we average normalization statistics for indistinguishable corruptions obtaining S′k sets, one for each macro corruption.
Inference. At test time, we use prototypical features to select the best set S*.
Prototype matching. We perform inference on each test image {tilde over (x)}u with unknown corruption u. First, we extract feature Φu=ℑn({tilde over (x)}u), retaining the first n high-frequency components of the FFT amplitude spectrum. Then, we compute the probability that image is corrupted with corruption k such that p(u=k)=(Φu,
′ using L2distance. Note that a test image can also be non-corrupted; we will explain how this case is handled in the next paragraph.
This may be expressed as, wherein the corruption identification module includes a fast Fourier transform, FFT, model, and wherein estimating the corruption type of the corrupted image using a corruption identification module comprises: extracting (or obtaining) features from the corrupted image; retaining (or maintaining) only features occurring with a frequency above a frequency threshold using the FFT model; determining a probability that the image in input is affected by a corruption type associated with the retained features is a corruption type based on a distribution of distances between the retained features and precomputed prototypical features of a known set of corruption types; and estimating the corruption type by selecting the corruption type with the highest probability. The distribution of distances is a distribution of Euclidian, L2, distances.
Selection of statistics. We use probability scores in order to select the most suitable set of normalization statistics S* via our codebook, and apply it on top of the model F to enhance object recognition capabilities. First, we determine whether the corruption is uncertain, by applying a thresholding operation on the first two most likely corruptions. We define {circumflex over (k)}1 and {circumflex over (k)}2 as the most likely and second most likely estimated corruptions. If |p(u={circumflex over (k)}1)−p(u={circumflex over (k)}2)|<T, then we use corruption-generic normalization statistics
. Otherwise, we use corruption-specific normalization statistics S′{circumflex over (k)}
; however they have intrinsic noise and sometimes using S′{circumflex over (k)}
. Instead, each corruption has its own set of normalization layer parameters in the Corr-S model, and aggregation of the FROST macro corruptions provides a good approximation of it which is more convenient for corruption identification via FFT (see
With reference to
The corruption type is associated with a probability of being true. This probability is compared to a threshold probability as part of an uncertainty estimation 50. If the probability is lower than the threshold, the pre-trained computer vision model 26 may be used to perform the computer vision task, e.g. object recognition. If the probability is greater than or equal to the threshold, the updated computer vision model 52 is used for the computer vision task.
With reference to maps the corruption identified by the CIM 22 to the respective corruption-specific BN parameters. Such parameters are initialized with the ones of the pretrained downstream task model F(⋅) (computer vision model 32) on clean source images
and adapted to test images via TTA, separately for each identified corruption {circumflex over (κ)} (Sec. 3.3), obtaining a corruption-specific set
. Finally,
is plugged into F(⋅) to generate a robust model 58 achieving enhanced robustness on downstream tasks, specifically on the identified corruption.
This approach builds upon the observation that statistics of BN layers in any convolutional architecture significantly differ for images corrupted according to different corruption types, but are similar for images with the same corruption type. Some previous work explored adaptation of statistics of normalization layers for TTA, keeping a single set of normalization parameters for all corruptions, to build generic normalization layers to accommodate any input corruption. Instead, we build multiple sets of normalization statistics estimated for each corruption type. Per-corruption Adaptive Normalization (PAN) is composed of three parts:
Image corruption: Let F(x, y; ) be a DNN model mounted on a robot for visual scene understanding. The aim of F(⋅) is to approximate ground truth labels y∈
of input images x∈↦⊂
w×h×3 optimizing its set of learnable parameters
(e.g., weights and biases of the network architecture of the model). Among these parameters, we denote the set of parameters of its BN layers by Λ⊂
. Samples of a source (clean) dataset
={
} are drawn from a probability distribution
(x) on a source domain
. Then, we consider a target (corrupted) dataset
={
} of distorted images sampled from a target domain
. We make a distinction between real (endogenous) and synthetic (exogenous) distortions as follows:
Endogenous distortions are natural corruptions that imply a shift in image statistics due to either inherent noise of camera sensors, deformations of objects observed in the images, or divergence of patterns of the objects. This is the most general case, where target test data cannot be parametrized by any operator. We denote as
a corrupted set of images, presenting the same type of corruption (e.g., dark images, where κ denotes the corruption type). The distribution of the images in the corrupted set is different from that of the source images, that is,
(x)≠
(x).
Exogenous distortions are synthetic approximations of real corruptions provided by a function of clean images. They are obtained assuming that there exists an operator Ck,s which corrupts a given set of clean images by Ck,s(
)=
. Synthetic corruptions represent an approximation of real corruptions, i.e.,
≈
, where k∈
denotes the corruption type and s denotes the level of severity of the corruption. Images of each corrupted set
are sampled from
(X)=ψk,s(
(x)) as the operator Ck,s transforms the distribution by a non-linear transformation ψk,s according to the corresponding corruption type k and severity s.
The CIM 22 is designed using a convolutional encoder followed by a linear classifier.
Architecture of the CIM. Extraction of corruption-specific features is accomplished through a DNN model r=l∘g, composed of a convolutional encoder g(⋅) that projects an input image x to a feature vector by z=g(x), and a linear layer l(z) that outputs corruption identification probabilities.
Training: The CIM 22 performs a corruption classification task to recognize and approximate the corruption present in each input image. A CIM model is trained on a training set KUκ=1K
, where each
={
} is a dataset of images corrupted with some corruption κ (can be endogenous or exogenous, where k=κ in case of exogenous) labelled with the corruption label
=κ denoting its distortion.
is the set of possible corruption labels which has cardinality |
|=K. The CIM model is trained end-to-end following via a distance-based contrastive training method utilizing a Class Anchor Clustering (CAC) loss defined by:
Inference: After training the CIM model r(⋅) on K, the final layer l(⋅) is removed and the feature extractor g(⋅) is used to extract q-dimensional features z∈
q from corrupted samples 1. Then, prototypical features
are computed from the training set, where each zi is a feature vector corresponding to an image corrupted with corruption κ, and hκ is the number of samples affected by the corruption κ. The calculated K prototypes are concatenated by K×q where (⋅)T denotes the vector/matrix transpose.
We employ a distance-based classifier ϕ(⋅,⋅) to classify features according to their relative distance to prototypical features. The classifier ϕ(z,
The output is normalized by b=d⊙(1−soft min(d)), where ⊙ is the element-wise product, and
In this way, the CIM 22 includes a machine learning model (e.g. a deep neural network model) trained to estimate (or identify) a corruption type using a corrupted image. Estimating the corruption type of the corrupted image using the CIM 22 includes inputting the corrupted image to the machine learning model to estimate the corruption type from the corrupted image.
Batch Normalization (BN) is a technique, used to make training of artificial neural networks faster and more stable through normalization of the layer inputs by re-centering and re-scaling. It is widely used in DNNs to mitigate the problem of internal covariate shift, where changes in the distribution of the inputs of each layer affect the learning of the network. BN is applied over a 4D input (a mini-batch of 2D inputs with additional channel dimension).
Let denote a mini-batch of features, obtained using model F(⋅), and let f∈
⊂
B×D×L be a feature map in the mini-batch. The mean μ∈
D and standard-deviation σ∈
D (BN statistics) are employed per-dimension over the mini-batches channel-wise for normalizing features using
Test-Time Adaptation (TTA) refers to adapting DNNs to distribution shifts, with access to only the unlabelled test samples belonging to the target domain at test time. The conventional way of employing BN in test time is to set μ and σ2 as those estimated from source data. Instead, TTA methods estimate BN statistics directly from test batches to reduce the distribution shift at test time by
This practice is simple yet effective and thus adopted in many recent TTA studies. In our paper, we propose updating BN statistics via TTA, separately, per each corruption type, as described next.
Estimating statistics on test data. Let Λ(μ, σ2)⊂ be the set of BN statistics of the model F(x, y;
). We denote the set of BN statistics obtained after training the model on the source dataset by
. We first initialize K sets of source BN parameters
. Then, we update each set according to the corruption type present in the input image. In the ideal case, each set is associated to a specific corruption type κ, and each corruption type is always identified correctly. Therefore, the BN statistics
associated with the type κ are updated only with images corrupted with corruption type κ that belong to the test set
. We define this ideal reference set of statistics by Λκref. However, the target corrupted test images come without the corruption label κ, and BN parameters must be computed on the corruption type estimated by CIM ({circumflex over (κ)}).
When deployed on a robotic device, our system is composed of (i) a CIM module employed to recognise the corruption type affecting the unlabelled input test image, and (ii) K sets of clean BN statistics, obtained training a model F(⋅) on clean training data. The purpose of our PAN is to improve the downstream task performance of F(⋅) by using CIM to identify the correct corruption type, update the correct set of BN parameters via TTA, and finally plug the updated set of BN parameters into the network.
Codebook mapping. In detail, at inference time, for each input test image X∈, we estimate the corruption type using the CIM by r′(x)={circumflex over (κ)}. Then, we use a codebook
to map each estimated corruption type {circumflex over (κ)} to a corruption-specific set of BN statistics
by
Note that BN statistics associated with each of the K corruptions are initialized as , and will be assigned to
after they are estimated by TTA. The more CIM is able to correctly recognize the corruption (when {circumflex over (κ)}=κ), the more the BN statistics are specialized for such corruption and are different from the others.
With reference to
The corruption type is associated with a probability of being true. This probability is compared to a threshold probability as part of an uncertainty estimation 50. If the probability of the most probable corruption type is lower than the threshold, the corruption type is labelled as unseen. The unseen corruption type is added to a corruption type storage 60. The corruption type with the highest probability is taken to be the corruption type and TTA may be used to update the normalisation parameters in the storage 48. The computer vision model is then updated to a specialised computer vision model 52. The specialised computer vision model includes the normalisation parameters from the storage 48 that are associated with the corruption type.
With reference to
In this way, the method shown in
With reference to
The normalisation statistics may include rolling mean and variance. The rolling mean, E[xt] may be updated using formula 7:
The variance, Var[xt] may be updated using formula 8:
Here I would add also the update of features prototypes
In this way, the method described in
With reference to
If the image is corrupted, it is assessed 106 whether the corruption type is known. If no, the corruption type is stored in storage 60. If yes, the corruption type is used to select 108 corresponding normalisation parameters. Next, it is determined 110 whether TTA is needed. If so, the normalisation statistics are updated as described above. If no, the object detection is performed 104.
With reference to
With reference to
With reference to
With reference to
With reference to
Other applications of the foregoing methods are also envisaged.
For example, with reference to
With reference to
With reference to
The smart fridge 800 also includes a sensor such as a camera, e.g. an RGB camera. The processor may implement instructions on the storage to execute one or more computer-implemented methods including a method of recommending a shopping list, list available items, or recipes that can be cooked with the available items. When items are added to the fridge, the method comprises the camera capturing an image of the items. The image is a corrupted image. The method will perform a computer vision task using the captured image. The computer-vision task may be object recognition, e.g. recognising food, or items, entering and being removed from the fridge. The method will determine the shopping list, list available items, or recipes that can be cooked with the available items.
Those skilled in the art will appreciate that while the foregoing has described what is considered to be the best mode and where appropriate other modes of performing present techniques, the present techniques should not be limited to the specific configurations and methods disclosed in this description of the preferred embodiment. Those skilled in the art will recognise that present techniques have a broad range of applications, and that the embodiments may take a wide range of modifications without departing from any inventive concept as defined in the appended claims.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2313525.4 | Sep 2023 | GB | national |
| 2405097.3 | Apr 2024 | GB | national |
This application is a continuation application, claiming priority under § 365(c), of an International application No. PCT/IB2024/057240, filed on Jul. 26, 2024, which is based on and claims the benefit of a United Kingdom patent application number 2313525.4, filed on Sep. 5, 2023, in the United Kingdom Intellectual Property Office, and of a United Kingdom patent application number 2405097.3, filed on Apr. 10, 2024, in the United Kingdom Intellectual Property Office, the disclosure of each of which is incorporated by reference herein in its entirety.
| Number | Date | Country | |
|---|---|---|---|
| Parent | PCT/IB2024/057240 | Jul 2024 | WO |
| Child | 18933406 | US |