This application claims the benefit under 35 U.S.C. § 119(a)-(d) of United Kingdom Patent Application No. 2303369.9, filed on Mar. 8, 2023 and titled “IMAGE CLASSIFICATION METHOD, AND COMPUTER-READABLE STORAGE MEDIUM”. The above cited patent application is incorporated herein by reference in its entirety.
The present application is directed to the application of an attention mechanism in computer vision. In particular, the present application relates to an image classification method for a computer device. For example, the method provides an attention mechanism applied over a quadtree representation of an input image. The method may for example be used in the field of video surveillance, where there is a need to classify images.
Methods for diverting attention to the most important regions of an image and disregarding irrelevant parts are called attention mechanisms. In a vision system, an attention mechanism can be treated as a dynamic selection process that is realised by adaptively weighting features according to the importance of the input.
In computer vision, attention is either used in conjunction with convolutional neural networks (CNN) or used to substitute certain aspects of convolutional neural networks while keeping their entire composition intact. However, this dependency on CNN is not mandatory, and a pure transformer applied directly to sequences of image patches can work exceptionally well on image classification tasks. One such example is the Vision Transformer (ViT) model. A transformer in machine learning is a deep learning model that uses the mechanisms of attention, differentially weighting the significance of each part of the input data. Transformers in machine learning are composed of multiple self-attention layers.
The ViT model represents an input image as a series of image patches, and directly predicts class labels for the image. CNN uses pixel arrays, whereas ViT splits the images into visual tokens. The visual transformer divides an image into fixed-size patches, correctly embeds each of them, and includes positional embedding as an input to the transformer encoder.
The self-attention layer in ViT makes it possible to embed information globally across the overall image. The model also learns on training data to encode the relative location of the image patches to reconstruct the structure of the image.
Vision transformers have extensive applications in popular image recognition tasks such as object detection, segmentation, image classification, and action recognition.
In recent years, Transformer models have revolutionized machine learning in general. While this has resulted in impressive results in the field of Natural Language Processing, Computer Vision quickly stumbled upon computation and memory problems due to the high resolution and dimensionality of the input data. This is particularly true for video, where the number of tokens increases cubically relative to the frame and temporal resolutions. A first approach to solve this were Vision Transformers, which introduce a partitioning of the input into embedded grid cells, lowering the effective resolution. More recently, Swin Transformers introduced a hierarchical scheme that brought the concepts of pooling and locality to transformers in exchange for much lower computational and memory costs.
However, Swin Transformers only consider long range dependencies in an indirect fashion. It is only thanks to its window shifting mechanism that, through the use of multiple Swin Transformer blocks, information can slowly diffuse to distant regions of the image.
This work proposes a reformulation of Swin Transformers that views Swin Transformers as regular Transformers applied over a quadtree representation of the input, a reformulation that intrinsically provides a wider range of design choices for the attentional mechanism.
Compared to similar approaches such as Swin and MaxViT, our method works on the full range of scales while using a single attentional mechanism, allowing us to simultaneously take into account both dense short range and sparse long-range dependencies with low computational overhead and without introducing additional sequential operations, thus making full use of GPU parallelism.
According to an aspect of the present disclosure, there is provided a method of obtaining an attention matrix for use in a transformer-based model, the method comprising:
where |(Di)j|=pk for j∈[1, . . . , m] and |F| corresponds to the feature length of the feature vector of each spatial location,
The method according to the present disclosure may further comprise:
In the method according to the present disclosure, the attentional mechanism may be applied over all possible sets of k consecutive spatial axes, resulting in m-k+1 attention matrices.
The method according to the present disclosure may further comprise:
In the method according to the present disclosure, the transformer-based model may be for either image classification or regression given a two-dimensional input representing an image.
In the method according to the present disclosure, the transformer-based model may be for either video classification or regression given a three-dimensional input representing a video.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing a program for causing a computer to execute a method of obtaining an attention matrix for use in a transformer-based model, the method comprising:
where |(Di)j|=pk for j∈[1, . . . , m] and |F| corresponds to the feature length of the feature vector of each spatial location,
In the non-transitory computer readable storage medium, the method may further comprise:
In the method in the non-transitory computer readable storage medium according to the present disclosure, the attentional mechanism may be applied over all possible sets of k consecutive spatial axes, resulting in m-k+1 attention matrices.
In the non-transitory computer readable storage medium according to the present disclosure, the method may further comprise:
In the non-transitory computer readable storage medium according to the present disclosure, the transformer-based model may be for either image classification or regression given a two-dimensional input representing an image.
In the non-transitory computer readable storage medium according to the present disclosure, the transformer-based model may be for either video classification or regression given a three-dimensional input representing a video.
According to another aspect of the present disclosure, there is provided an apparatus for obtaining an attention matrix for use in a transformer-based model, the apparatus having at least one processor configured to:
where |(Di)j|=pk for j∈[1, . . . , m] and |F| corresponds to the feature length of the feature vector of each spatial location,
In the apparatus according to the present disclosure, the at least one processor may be configured to:
In the apparatus according to the present disclosure, the at least one processor may be configured to apply the attentional mechanism over all possible sets of k consecutive spatial axes, resulting in m-k+1 attention matrices.
In the apparatus according to the present disclosure, the at least one processor may be configured to:
In the apparatus according to the present disclosure, the transformer-based model may be for either image classification or regression given a two-dimensional input representing an image.
In the apparatus according to the present disclosure, the transformer-based model may be for either video classification or regression given a three-dimensional input representing a video.
Additional features of the present disclosure will become apparent from the following description of embodiments with reference to the attached drawings.
Swin Transformers are a type of neural topology based on transformer networks. Contrary to conventional transformers and their Computer Vision counterpart, Vision Transformers (ViT), Swin employs a hierarchical partitioning of the image in order to define local regions upon which the attention mechanism is applied. The image resolution is successively halved after a specific number of transformer blocks, which increases reach of the attention mechanism without actually increasing the size of the individual attention matrices. The main problem with such an approach is that long range dependencies are taken into consideration in an indirect fashion. It is only thanks to its window shifting mechanism that, through the use of multiple Swin Transformer blocks, information can slowly diffuse to distant regions of the image.
In this work we propose a reformulation of the method, exposing it as a sequential computation over an quadtree representation of the input images. We then propose a new flavor of Swin transformers that naturally emerges from said reformulation, which we refer to as Swin on Axes (Swinax). Compared to the first, Swinax can jointly consider token relationships at multiple scales, the dilation factor of the attentional mechanism increasing along with said scale. It can thus achieve the same degree of sparsity and computation of long range dependencies Swin achieves over multiple layers, but it does so in a single transformer block, with the computations being performed in parallel.
Transformers were first introduced in 2017 as an attentional mechanism that allows for the passing of information between a collection of tokens X={x(1); . . . ; x(k)} in language models. The attentional mechanism in question extracts a series of keys KX=[fK(x(1)); . . . ; fK(x(k))], queries QX=[fQ(x(1)); . . . ; fQ(x(k))] and values VX=[fX(x(1)); . . . ; fV(x(k))], then proceeding to apply the attentional mechanism shown below for a given attention head j:
Here, dK is the dimensionality of keys KX(j), with Smx(⋅) corresponding to the softmax activation function. A given attentional layer has m attention heads, with the final output of the layer being a linear combination of the latter:
Apart from the attentional mechanism, a transformer block is typically followed by a two-layer fully connected network updating the token representations, with layer normalization after both the multi-head attention and fully connected sub-network. The original work proposed an encoder-decoder architecture, where a series of transformer blocks are stacked to produce an encoder processing an in-put token sequence, and similar set of such layers produces a decoder. Contrary to the encoder, the decoder consists of two attention mechanisms: the first one is the regular self-attention block, where the attentional mechanism looks at other tokens within the same sequence. The second one updates the token representations of the decoder based on those of the encoder in what is commonly known as cross-attention.
Latter variations introduced encoder-only and decoder-only models. The former model either the relationship between elements in a sequence by reconstructing masked input tokens, or directly predict a target la-bel based on the input sequence. The latter, on the other hand, are trained in an auto-regressive fashion by masking the self-attention mechanism, ensuring that each token will have access only to itself and the previous tokens.
The first significant success when applying transformer models to computer vision were Vision Transformers (ViT). Here, the author proposed an encoder-only topology where an input image is broken down into non-overlapping cells, typically using a 16×16 grid. Each of the cells is linearly projected into a feature vector representing a token, with an additional classification token added to the sequence. This token sequence is then fed to the encoder-only transformer, with the final output of the classification token being used as overall model output.
While this type of model proposes a robust tokenization approach, it still runs into the problem of a high memory usage in the order of O(n4), where n=16 is the cell grid resolution, and the corresponding high computational cost required to generate the attention matrix. While this would still correspond to a relatively small attentional matrix when compared to some language models, a large batch size is usually required in order to stabilize the training of computer vision models. A common way to solve this problem is to apply restrictions on the considered tokens when computing the attentional mechanism.
Local attention enforces locality during the computation of the attentional mechanism, with Swin transformers being the most well known example. In Swin, the grid of tokens is partitioned into non-overlapping W×W windows, where typically W=7. The attentional mechanism is applied to each window, with the window being displaced by 50% of its width at alternating layers. This allows the model to limit the size of the attentional matrices while still allowing the tokens falling near the edges to diffuse throughout the image. This reduces the memory usage from O(n4) to O(n2W2). Swin also introduces a linear down-sampling function where, after a set of d transformer blocks, neighboring tokens in a non-overlapping grid of 2×2 cells are linearly combined together. This further reduces the memory and computational costs by lowering the number of tokens fed to subsequent transformer blocks, while also indirectly increasing the receptive field of the attention mechanism.
Neighborhood Attention (NA) considers a local token neighborhood for each of the target tokens during attention computation, resulting in a model equivalent to the marking of all elements outside the first k diagonals in the attention matrix. In order to efficiently implement it, the authors provide low level kernels to perform the attention computation, bypassing the otherwise prohibitive memory and computational cost restrictions that would result from implementing is as a series of linear algebra operators. The advantage of this approach is its intrinsic ability to diffuse information across the whole image, due to its lack of hard region boundaries between image segments. Similarly to Swin, a linear down-sampling step is introduced after every few attentional blocks.
A more recent work proposes Dilated Neighborhood Attention (DiNA), introducing dilation to NA. Similar to the approach it is based on, this is achieved through a low level CUDA kernel implementation. The dilation factor then becomes a per layer model hyper-parameter.
Sparse attention aims to combine sparse global attention patterns with denser local ones by introducing strategies that reduce the memory and computational costs. Sparse Transformers was the first work to introduce such an approach. The authors noted that ViT first learned locality patterns, factorizing across the vertical and horizontal axes on latter layers. Based on that, they implemented hand crafted factorizations of the attentional mechanism, weaving these insights into the architecture.
Longformers use a combination of sliding window attention, optionally with dilation, and global attention. They do so by considering a select subset of tokens as globally connected, serving as shortcuts for the diffusion of information across the token sequence. Routing Transformers use a learned attention connectivity instead. To do so, a series of k centroids are learned and used during k-means clustering to group the queries and keys of the attention mechanism. This restricts the attention matrix to mappings between queries and keys belonging to the same cluster. Note that nothing prevents a given token from having its query and key belong to different clusters, potentially giving the attention matrix a single component graph connectivity.
MaxViT, an approach based on Swin transformers, proposes decomposing an input X∈RH×W×C> into a batched form
In this form, applying an attention mechanism over axes N×N corresponds to the local windowed attention commonly seen in Swin. Applying it over axes
instead corresponds to a dilated attention over the whole picture, where the dilation factor is N. By applying two successive attention blocks, one over the local windows and the other dilated over the whole input, the approach considers both dense local and sparse global attention.
Our work falls on the latter category, proposing an attentional mechanism, also known as an attention mechanism, where the attention pattern gradually sparsifies the further away a region is from the target token. It shares some similarities with MaxViT, but using a more general, principled approach that lends it a higher degree of flexibility.
The proposed approach is a reformulation of Swin transformers where the segmentation of the input into local attentional regions and linear pooling of tokens is seen as regular transformer and dimensionality reduction operations applied to a quadtree representation of the input image. As we show in this section, this allows for an equivalent more straightforward application of the approach, as well as allowing for other attentional operations such as attention dilation (sec.3.3) and multi-scale attention.
Swin defines a window of constant size W×W which is applied over the input token grid in a tiled, non-overlapping fashion, as shown in
After successively applying a series of such layers, the resolution of the grid is reduced to half its original size by linearly projecting the features in each 2×2 non-overlapping neighborhood into a single output feature vector. This reduces the overall number of cells by 75%. Further transformer blocks are applied after that. While these maintain the same window size and thus the number of cells considered and size of the attention matrix, the receptive field effectively doubles in size due to the down-scaling of the grid.
An intuitive way of representing the data input to a Swin transformer block is to define X∈RB×N
Here we propose a different approach where the input data is directly structured into a quadtree, a hierarchical structure that is perfectly suited for this task. To do so, we may first expand the height and width dimensions of the input images into a series of size 2 axes, then reshuffle and reshape:
For 2d images, this results in a series of axes {a1, . . . , an} of size 4. Please note that the above transform can just as easily be applied to three-dimensional data, such as video, or volumes. There, each dimension ai would be of size 8. Under this representation, axis ai partitions an input 2d image into four equally sized chunks corresponding to the four quadrants of a quadtree, while the successive dimensions each partition the previous quadrants into four more segments. This is illustrated by the leftmost image in
Here, the last three spatial dimensions, also referred to as spatial axes, (a4, a5, a6) correspond to the attention window, while the last dimension (F) are the features. The other dimensions correspond to the batch size. After applying k transformer blocks on the data, Swin performs a linear transform on non-overlapping neighbourhoods of size 2×2, resulting in a reduction of the image size. In our representation, this corresponds to shifting the selected axes for the attention window to the left such that the attention window corresponds to dimensions (a3, a4, a5) and the last two dimensions (a6,∧F) can then be linearly transformed and replaced by a single output feature dimension.
A direct extension to Swin comes from considering what would happen were we to shift the selection of attention window axes without subsuming the rightmost axis into the feature representation. That is, if we consider the following assignment of axes:
where B and axes (a1, a2, a6) jointly define the batch size, while axes (a3, a4, a5) define the attention window. Applying a transformer block over the above representation is equivalent to doubling the size of the attention window while maintaining the number of tokens, which is achieved through an implicit 2× dilation of the attention regions. This is illustrated by the rightmost image in
With this approach, we can compute larger attention regions without having to sub-sample the image nor increase the computational and memory costs. This allows us to consider both local and global attention before sub-sampling the grid cells, increasing the flexibility of Swin. The shifting of the attention windows, as performed on Swin at alternate attention blocks, also becomes unnecessary. This is due to subsequent dilated layers already allowing communication across attentional window boundaries by increasing the receptive field.
We can further exploit the quadtree representation by applying the attention mechanism to multiple axes simultaneously. The simplest approach is to consider each axis independently, resulting in d attention matrices of size 4×4. This results in an attention mechanism where, for any given token, information is shared more densely among tokens that are closer in the quadtree representation, with the dilation factor increasing for tokens that are further away.
This is shown in the middle diagram of
The keys KX(j) queries QX(j) and values VX(j) are computed once for all tokens and shared among the different attention windows, while a different set of attention biases B(j,k) is learned for each. The final value for any given attention head is obtained by multiplying the attention tensors Ã(j,k) with the values tensor VX(j), then adding the resulting matrices together.
This corresponds to independently applying the same attentional mechanism over multiple dilation factors, instead of a single one computed over multiple scales. This can be solved by jointly computing the Softmax activation over the different attention tensors:
where d is the dimensionality of the quadtree representation, w is the number of axes within the attention window, and Â(j,k) are the attention matrices before applying Softmax. The resulting algorithm for the computation of a multi-scale attention head j is shown in Alg. 1, where ⊗ denotes the Einstein summation operator.
As for the computational complexity, the attentional mechanism is slightly more costly when compared to Swin. Given p the number of tokens, typically p=22d given that in our approach the image size is ideally, but not limited to being, a power of 2 in both height and width, the number of axes of the quadtree representation equals d=log2(√{square root over (p)}). This results in the following complexities:
Whereas MSA incurs a quadratic cost relative to the input size, with the Windowed MSA being linear, Multi-Scale MSA has a slightly above linear cost that scales as p log2(√{square root over (p)}) relative to the number of tokens p. From
Where d is the number of axes of the quadtree representation and D is the input data dimensionality (D=2 for images, D=3 for volumes such as video frames). As the input size increases, that is, as d→∞, the above formula can be simplified to the following form:
This results in an overall redundancy of Red(2)=25% for 2d images, and Red(3)=12.5% in the 3d case given sufficiently large inputs. This overhead would further decrease as the dimensionality of the input data increases.
In order to evaluate our approach, we compare it against the standard Swin-T architecture used in the original paper, Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. “Improving language understanding by generative pre-training” 2018, the contents of which are incorporated herein by reference. We use an equivalent model with the same number of layers, feature size and linear down-scaling of the input images as in the original work, resulting in an equivalent number of trainable parameters. The only differences, applied to both the proposed approach and the baseline Swin-T model, are the following:
The reason for the latter change is the quadtree image representation. In order to apply the transform seen in Eq. 3, we can resize the input images to be of size
The resizing can be done through bicubic interpolation of the pixel values or by cropping of the image. However, the present disclosure is not limited to images with the above size and so the resizing step is not essential. In particular, the input image spatial dimensions (H, W) can be factored such that H=Q×2n and W=R×2n with the factors of each spatial dimension being paired up in the quadtree representation.
We consider three different variations of our model. Swin-D Applies a dilation factor of ×1 (no dilation) and ×2 to alternating layers by following the approach described in Sec. 3.3.
Since this allows us to increase the receptive field without down-sampling the tokens, we drop the window shifting performed by the baseline Swin model. Swinax-S applies per-axis attention as described in Sec. 3.3 for a sliding window of size w=2, while Swinax-N uses uses the same window size as well as joint attention normalization. Attention window shifting is also dropped in both cases.
We consider three classification tasks. The first is a 103-class variation of ImageNet obtained from the commonly used ImageNet1K. The classes have been merged according to the WordNet lexical database. In order to construct it, we iteratively aggregate all classes belonging to the superclass with the smallest number of total samples. This dataset is used to perform an ablation study comparing our baseline Swin-D with the other three proposals.
In
When we consider multi-scale attention, we see that just executing the attention mechanism on the collection A of axis subsets results on a drop in accuracy. On the other hand, if we perform a joint Softmax normalization instead of considering each scale independently, we obtain a slightly slower initial convergence rate, but eventually surpass all other considered approaches.
The other two datasets are ImageNet1K with the full 1000 classes, and Places365. For these two, we compare the baseline Swin-T model against Swinax-N, which we found to be the best performing model during the ablation study.
Referring to
where |(Di)j|=pk for j∈[1, . . . , m] and |F| corresponds to the feature length of the feature vector of each spatial location,
In
The CPU (processor) 101 is a system control unit and controls the entire apparatus 100. A program, stored in the HDD 104, may cause the apparatus 100 to execute a method of obtaining an attention matrix for use in a transformer-based model, according to any one of the above-mentioned examples.
In
The ROM 102 may store the said program and an operating system (OS) to be executed by the CPU 101. The RAM 103 provides memory for temporarily storing various types of information when the CPU 101 executes the program. The HDD 104 is a storage medium according to the present disclosure.
The display 105 (display unit) is a device for presenting a user interface (UI). The display 105 may include a touch sensor function. The keyboard 106 is one of input devices. For example, the keyboard 106 is used to input predetermined information to the UI displayed on the display 105. The mouse 107 is one of the input devices. For example, the mouse 107 is used to click on a button on the UI displayed on the display 105.
The data communication unit 108 (communication unit) is a device for communicating with external apparatuses.
The data bus 109 connects the foregoing units (102 to 108) with the CPU 101.
Some embodiments may be implemented as a recording medium including a computer-readable instruction such as a computer-executable program module. The computer-readable recording medium may be an arbitrary available medium accessible by a computer, and examples thereof include all volatile and non-volatile media and separable and non-separable media. Further, examples of the computer-readable recording medium may include a computer storage medium and a communication medium. Examples of the computer storage medium include all volatile and non-volatile media and separable and non-separable media, which have been implemented by an arbitrary method or technology, for storing information such as computer-readable instructions, data structures, program modules, and other data. The communication medium generally includes a computer-readable instruction, a data structure, a program module, other data of a modulated data signal, or another transmission mechanism, and an example thereof includes an arbitrary information transmission medium.
While the disclosure has been particularly shown and described with reference to embodiments thereof, it will be understood by one of ordinary skill in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as defined by the following claims. Hence, it will be understood that the embodiments described above are not limiting of the scope of the disclosure.
Although the method has been described as a method for image classification, it is to be understood that the method may relate to part or all of a method for image classification, image segmentation, and pixel-based analysis, where the remaining part of the method, if any, corresponds to conventional techniques.
The scope of the disclosure is indicated by the claims rather than by the detailed description of the disclosure, and it should be understood that the claims and all modifications or modified forms drawn from the concept of the claims are included in the scope of the disclosure.
Number | Date | Country | Kind |
---|---|---|---|
2303369.9 | Mar 2023 | GB | national |