The present invention relates to an image recognition method, which is particularly advantageous in the context where only very few labelled training images are available. More specifically, but not by way of limitation, the proposed method provides efficient image anomaly detection, and it allows tedious image annotations to be bypassed. The present invention also relates to a corresponding imaging apparatus configured to carry out the proposed method.
Currently, supervised-based deep learning approaches are ubiquitous and they achieve state-of-the-art performance for various tasks, including anomaly detection in image data. However, obtaining labels for image data is expensive. For example, every day, large amounts of healthcare data, e.g. medical images, have become available in various healthcare organisations, such as large hospitals that constitute precious and unexploited resources. Nevertheless, their annotation is missing, and medical images or images in other application fields require precise and time-consuming analysis from domain experts, such as radiologists. This hinders the applicability of supervised-based machine learning models in a real scenario that require large amounts of annotated training data.
This issue is further exacerbated for large-scale data sets as they usually suffer from the problem of data imbalance. Training a machine learning system on an imbalanced dataset can introduce unique challenges to the learning problem. Imbalanced data normally refers to a classification problem where the number of observations per image class is not equally distributed. More specifically, often a large amount of data/observations exist for one class (referred to as a majority class), and much fewer observations for one or more other classes (referred to as the minority classes). For real large-scale data, the amount of data of different categories will often not be an ideal uniform distribution, and these data sets usually exhibit long-tailed label distributions if the classes are sorted along the x-axis according to the number of samples from high to low, and where the y-axis represents the number of occurrences per class. In the case of anomaly detection, anomalies are usually rare in the collected data, and deep neural networks have been found to perform poorly on rare classes of anomalies. This particularly has a pernicious effect on the deployed model if more emphasis is placed on minority classes at inference time. Therefore, training models in a fully unsupervised or self-supervised fashion would be advantageous, allowing a significant reduction of time spent on the annotation task.
Another critical issue of most existing anomaly detection methods is that they can only be applied to data from a single image domain. The pre-trained deep anomaly detection networks suffer significant performance degradation when exposed to a new image dataset from an unfamiliar distribution. Using available ad hoc domain adaptation techniques only provides suboptimal solutions. These techniques also need label-rich source domain data to transfer knowledge from source domain data to unseen target domain data.
It is an object of the present invention to overcome at least some of the problems identified above related to image processing methods and their related systems. More specifically, one of the objects of the present invention is to propose a solution for detecting anomalous or out-of-distribution images.
According to the first aspect of the invention, there is provided a method in a machine learning system for detecting anomalous images as recited in claim 1.
The proposed image detection method has the advantage that it builds upon self-supervised learning such that the system can be trained with only a small amount of annotated data, and it avoids potential bias thereby making it practical in real-world scenarios. The present invention has also the advantage that it allows tedious annotations to be bypassed, or at least the number of image annotations can be significantly reduced. Maybe even more importantly, the proposed method is capable of working with both single and multi-domain image data. In other words, the proposed method can be understood as a new cross-modality image anomaly detection method. Thus, this means that the proposed method is particularly advantageous for improving anomaly detection in the presence of domain shift, and the module or system implementing the proposed method can be easily plugged into existing image recognition systems to improve their generalisation ability. Furthermore, the proposed method also solves the class imbalance problem.
According to a second aspect of the invention, there is provided a non-transitory computer program product comprising instructions for implementing the steps of the method according to the first aspect of the present invention when loaded and run on computing means of a computation apparatus.
According to a third aspect of the invention, there is provided a machine learning system configured to carry out the method according to the first aspect of the present invention.
Other aspects of the invention are recited in the dependent claims attached hereto.
Other features and advantages of the invention will become apparent from the following description of a non-limiting example embodiment, with reference to the appended drawings, in which:
An embodiment of the present invention will now be described in detail with reference to the attached figures. The embodiment is described in the context of a deep artificial neural network which is configured to detect anomalous or out-of-distribution images, but the teachings of the invention are not limited to this environment. For instance, the teachings of the present invention could be used in other types of artificial intelligence or machine learning systems. The teachings of the present invention may be applied in various technical fields including for instance medical applications (medical images), defect detection in industrial production (e.g. in watch industry), waste detection and analysis, remote sensing (aerial) imaging, event detection in sensor networks, etc. Identical or corresponding functional and structural elements which appear in the different drawings are assigned the same reference numerals. It is to be noted that the use of words “first”, “second” and “third”, etc. may not imply any kind of particular order or hierarchy unless this is explicitly or implicitly made clear in the context.
The process starts in step 101, where it is determined whether or not the incoming image data stream contains a single imaging modality. In the present description, an imaging or image modality is understood to mean an imaging or image domain or image type more broadly. For example, different image modalities can be distinguished by any property of the target object(s) (such as the object category, colour, etc.) in the respective image, the imaging protocols, scanners, or software used to capture or process the images, etc. If at least two image modalities are detected, in other words, it is detected that incoming training images 9 are collected from at least two different domains (i.e. the case of multimodal data), then in step 103, the training images 9 are grouped based on their image modalities into source domain images and target domain images. Here, the source domain refers to the domain of image data where the majority of the images are unlabelled, and a small fraction of the images are labelled, and the aim is to transfer anomaly detection from the source domain to a new image data set (target or test domain) from an unfamiliar distribution, where the target domain images are not labelled. In other words, a small portion of the source domain images are labelled (e.g. 1% to 10% of the images in that domain), while the target domain images are unlabelled according to the present example. Test images are all from the target domain.
In step 105, a deep generative model is trained, and the source domain images or at least some of them are transformed by this model to transform them into the appearances of the target domain. During this transformation, the source domain image's content, such as the shape, objects' category, and object structure, is preserved while other image properties such as the style information optionally including texture and/or the colour are converted or translated from a target domain image to an image in the source domain. More specifically, an image domain conversion or mapping is applied to convert the source domain images to match with the target domain images in terms of style information such as texture and/or the colour. In this manner, image transformations or converted or transformed source domain images are obtained. The cross-modality image conversion model or mapping function to implement this step, and which in the system shown in
The proposed method uses a two-step training process to learn image representations, also referred to as image features, of unlabelled data using a pre-training process, in the present description also referred to as a pretext task, at a pre-training stage, and then adapt those representations to the actual task of semi-supervised anomaly detection. The pre-training stage aims to leverage unlabelled data in a task-agnostic way using a defined pretext objective. Let ϕC(·, WC) : D→
d
D (also referred to as an input feature space or pre-training network input feature space) to a compressed representation in
d (also referred to as a first feature space) followed by the first projection head network 5 that further compresses the input into
d
D→
D that heavily modifies the respective input images 9, i.e. the operation augments the number of the images that can be used in the training process, as the transformation generates new images. Thus, referring back to the flow chart, in step 107, in multimodal data, transformations, which in this example are stochastic transformations are applied to both the source domain images and the converted source domain images obtained in step 105 to obtain transformed or modified images 13. In this example, the images are randomly modified but this does not have to be the case. The transformation operations may include one or more of the following operations: applying random colour Jittering to the respective images, cropping randomly resized patches from the respective images, and applying Gaussian Blur to the respective images.
On the other hand, if in step 101 it was determined that the incoming images 9 in the incoming data stream are all from a single modality, then the process continues in step 109, where the above-explained image transformations are applied to the source domain images (which are thus all from the same modality) to obtain transformed or modified images 13.
Next, in step 111, training of the pre-training network is carried out as explained next in more detail. The goal of the pre-training (i.e. the pretext task) is to optimise the weights WC of the pre-training network ϕC such that two versions of an image modified by T are brought together in representation space d
, the network is trained to identify xj from a set of N images {xk}k≠i. In this example, this is done by maximising a similarity method, which in this example is the cosine similarity between the representation of a pair (ϕC(xi)=zi and (xj)=zj) and minimising the cosine similarity with respect to the other samples' representations of the set of N images. The pretext task's objective LC can thus be formulated as
where zi=ϕC(xi; WC) and
and 1k≠i ∈ {0,1} is a function giving 1 if k≠i. N denotes the number of samples within a set of images or image set, i.e. in a minibatch, and τ is a first hyperparameter called temperature. This loss is also known as the InfoNCE loss as taught by Aaron van den Oord, Yazhe Li, and Oriol Vinyals, “Representation Learning with Contrastive Predictive Coding”, arXiv:1807.03748 [cs, stat], January 2019. arXiv: 1807.03748. 9. It is to be noted that each minibatch is in this example a user-specified number of training images. So instead of using all training images to compute gradients (as in stochastic training), minibatch training uses a user-specified number of training images at each iteration of the optimisation.
The optimisation process implemented during the pre-training phase may use an algorithm called stochastic gradient descent (SGD). The algorithm's aim is to find a set of internal model parameters that perform well against some performance metrics, such as the above-defined loss term of Equation 1. The algorithm's iterative nature means that the search process occurs over multiple discrete steps; each step ideally slightly improves the model parameters. Each step involves using the model with the current set of internal parameters to make predictions on a randomly sampled minibatch (few images) without replacement, comparing the predictions to the expected outcomes, calculating the error, and using the error to update the internal model parameters.
The first encoder ψC is configured to process a plurality of image pairs at the same time from each minibatch. The objective is to learn a unique representation of each image so that the modified images from a given image pair (positive pair) should be similar to each other while at the same time be different from other images and their modified versions (negative pairs). To be more specific, the pre-training network randomly samples a minibatch of N images and defines the InfoNCE loss on pairs of modified images derived from the minibatch, resulting in 2N data points. Given a positive pair of modified images from the same image, the other 2(N−1) modified images within a minibatch are considered as negative pairs. The final loss is computed across all positive pairs, both (i, j) and (j, i) in a minibatch, e.g., (x1, x4) and (x4, x1). The pre-training network thus advantageously forms all possible image pair combinations from the modified source domain images and optionally from the modified converted source domain images (in the case of multimodal data).
After the pre-training phase, the weights of the first encoder ψC are used to initialise the second encoder ψMAD for anomaly detection in step 113. In this example, hypersphere centres NS (in this example their locations are determined) are initialised using the K-means algorithm on the embedded normal samples, i.e. the features of the normal images, as opposed to anomalous image samples, and then non-meaningful clusters are removed progressively during the optimisation procedure in step 115. Hyperspheres are understood to define image clusters as becomes clear later. The K-means algorithm is a clustering algorithm, which is a method of vector quantisation that aims to partition a given number of observations into K clusters in which each observation belongs to the cluster with the nearest mean (cluster centres), serving as a prototype of the cluster. This results in a division of the data space into Voronoi cells. Hyperspheres are understood to be multidimensional spheres in the multidimensional representation or feature space. Furthermore, non-meaningful clusters refer to those image clusters in which the cluster cardinality is not large enough or include noisy samples. The system 1 does not have prior knowledge about the number of clusters at the beginning, and the number of clusters (hyperspheres) is set to be quite large at the beginning for initialisation. Then, the non-meaningful clusters (those clusters in which the cluster cardinality is not large enough or they include noisy samples) are removed progressively during the optimisation procedure in step 115. For example, the process may start with 10 clusters (hyperspheres) and end up with five clusters at the end of optimisation. In other words, the system 1 automatically obtains the optimal number of image clusters after optimisation.
In step 115, training of the anomaly detection network, i.e. fine-tuning of the machine learning system 1, is carried out. Formally we have access to n unlabelled samples x1, . . . , xn ∈ X with X⊆D, where D is the input dimension. In addition to the unlabelled samples, we have access to few m labelled samples ({tilde over (x)}1, {tilde over (y)}1), . . . , ({tilde over (x)}m, {tilde over (y)}m) ∈ X×y, where y={−1, +1}. Known normal samples are labelled as {tilde over (y)}=+1 and known abnormal samples are labelled as {tilde over (y)}=−1. Let ϕMAD (·, WMAD) :
D→
d
d
d
d
d
where the centre k is assigned as the closest hypersphere centre to the image under assessment, i.e. k=j ∥ϕMAD (xi; WMAD−cj)∥. The first term in Equation 2 penalises unlabelled points away from the closest centre since we assume that a majority of unlabelled samples come from the normal distribution. The second term in Equation 2 pushes known abnormal samples away from the closest centre and known normal samples towards that centre. Finally, the third term in Equation 2 imposes a regularisation on the network's weights WMAD with a second hyperparameter λ. A third hyperparameter η controls the relevance of the labelled terms in Equation 2. We opt for two-phase training instead of a joint training setting due to the difference in the data processed by each phase, and the two-phase training also requires fewer hyperparameters (scales of loss, etc.). The images used in step 115 are the training images 9, and more specifically the source domain images and optionally also the converted source domain images in the case of the multimodal data. In this step, the training images 9, which may be the same as the ones used in step 111, or they may be different or at least partially different training images.
The second encoder's ψMAD weights WMAD are in this example updated for the anomaly detection task (Equation 2) by using stochastic gradient descent (SGD) using backpropagation until convergence. At each step, a cluster centre is kept only if the cardinality of normal samples is greater than a fraction γ of the maximum cardinality. It ensures that the model learns the best number of centres without any a priori on the number of modes. Upon testing, in step 117, an anomaly score of a sample, i.e. a test image 11, is given by for example computing the distance or other distance metric between its embedding or features and the closest hypersphere centre: SMAD(x)=∥ϕMAD(x; WMAD−ck)∥ where k=j ∥ϕMAD(xi; WMAD−cj)∥ is the closest hypersphere centre. However, other ways to compute the anomaly score exist. The distance metric may be based on the Euclidean distance between the features of the test image and the features of the closest hypersphere centre which have the same dimension as the dimension of the features of the test image. However, the distance metric can be defined more broadly. For example, one can consider the most similar clusters (hyperspheres) with respect to the test image, e.g. by using the top-k matches.
The above-described example method uses a pairwise setup to compute the distance metric, but this formulation can be extended to consider triplets of training data samples or even higher number of training data samples. More specifically, for example, instead of generating two modified images 13 from any given training image 9, three modified images may be generated (in the case of triplets), and then fed into the first encoder ψC. The above teachings may be applied to these three images to enforce their feature representations to or towards the same point in the feature space at the output of the pre-training network with a feature dimension d
The above-described method may be carried out by suitable circuits or circuitry. The terms “circuits” and “circuitry” refer to physical electronic components or modules (e.g. hardware), and any software and/or firmware (“code”) that may configure the hardware, be executed by the hardware, and/or be otherwise associated with the hardware. The circuits may thus be configured or be operable to carry out or they comprise means for carrying out the required method as described above.
While the invention has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative or exemplary and not restrictive, the invention being not limited to the disclosed embodiments. Other embodiments and variants are understood, and can be achieved by those skilled in the art when carrying out the claimed invention, based on a study of the drawings, the disclosure and the appended claims. Further variants may be obtained by combining the teachings of any of the designs explained above.
In the claims, the word “comprising” does not exclude other elements or steps, and the indefinite article “a” or “an” does not exclude a plurality. The mere fact that different features are recited in mutually different dependent claims does not indicate that a combination of these features cannot be advantageously used. Any reference signs in the claims should not be construed as limiting the scope of the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2021/050753 | 1/30/2021 | WO |