Verification Of Perception Systems

FIELD

The present disclosure relates to verifying the behaviour of classifiers. In particular, but not exclusively, the present disclosure relates to verification of the robustness of a trained convolutional neural network to transformations of its input, and improving such robustness. The disclosure also identifies an enhanced method to perform learning on the basis of the counterexamples found during verification.

BACKGROUND

Autonomous systems are forecasted to revolutionise key aspects of modern life including mobility, logistics, and beyond. While considerable progress has been made on the underlying technology, severe concerns remain about the safety of the autonomous systems under development.

One of the difficulties with forthcoming autonomous systems is that they incorporate complex components that are not programmed by engineers but are synthesised from data via machine learning methods, such as a convolutional neural network. Convolutional neural networks have been shown to be particularly sensitive to variations in their input. At the same time, there is an increasing trend to deploy autonomous systems comprising convolutional neural networks in safety-critical areas, such as autonomous vehicles. These two aspects taken together call for the development of rigorous methods for the formal verification of autonomous systems based on learning-enabled components.

At the moment, no existing technique can provide formal guarantees about the robustness of a convolutional neural network to those transformations of its input that are to be expected in its deployment. There is therefore no effective means of providing formal assurances on the real-world behaviour of autonomous systems in which the output of a convolutional neural network is used to inform decision-making.

SUMMARY

According to a first aspect, there is provided a computer-implemented method for verifying robustness of a neural network classifier with respect to one or more parameterised transformations applied to an input the classifier comprising one or more convolutional layers.

The method comprises: encoding each layer of the classifier as one or more algebraic classifier constraints; encoding each transformation as one or more algebraic transformation constraints; encoding a change in an output classifier label from the classifier as an algebraic output constraint; determining whether a solution exists which satisfies the constraints above; =and determining the classifier as robust to the transformations if no such solution exists.

In this way, a trained neural network classifier may be assessed as to whether there exist any potential parameters for a given set of transformations that would cause a change in classifier output. If so, the classifier can be understood to lack robustness to that transformation. Verifying that a classifier meets such a robustness criteria can be important, particularly in safety critical implementations.

For example, the classifier may be configured to classify sensor data such as image data and/or audio data (for example, the classifier may be an image classifier and/or an audio classifier). The classifier may operate as part of a perception system configured to take one or more actions in dependence on the output of the classifier. The perception system may comprise, for example, the classifier and a controller, wherein the control element is configured to output one or more control signals in dependence on the output of the classifier. The perception system may further comprise an actuator configured to operate in accordance with control signals received from the controller. In such circumstances, the reliability of these actions may be compromised when the perception input is misclassified.

When a solution exists to the constraint problem above identified on the classifier, transformations and output constraints, the method may identify the parameters of the one or more transformations associated with the solution. This enables the construction of a counterexample to the classifier which can be used as evidence in safety-critical analysis. Furthermore it can be used to augment the dataset and retrain the classifier to improve the robustness of the classifier.

In some embodiments, generating the additional training data may comprise applying the one or more transformations to existing training data using the identified parameters.

Optionally, one or more of the classifier constraints the transformation constraints and output constraints are linear constraints. All constraints may be linear. Moreover, the constraints may comprise of equality and/or inequality constraints.

At least one or more of the transformations may be a geometric transformations, such as translation, rotation, scaling, shear. The transformations may additionally or alternatively comprise photometric transformations, such as brightness and contrast changes The transformations may be local transformations; for example the transformations may be element-wise transformations.

Optionally, the classifier may comprise one or more fully connected as well as convolutional layers. The fully connected layers and/or the convolutional layers may comprise rectified linear unit (ReLU) activation functions. The convolutional layers may comprise a pooling function such as a max-pooling function.

Optionally, encoding each layer of the classifier as one or more algebraic classifier constraints comprises deriving a mixed-integer linear programming expression for each layer. Indeed, the classifier, transformation and output constraints may all be expressed as a mixed-integer linear programming expression. The skilled person will recognise that solvers are available for such expressions that can efficiently determine whether a solution exists. Other linear inequality and equality constraint representations may be adopted where appropriate.

Optionally, the method may further comprise encoding one or more pixel perturbation constraints and determining whether the solution exists may comprise determining whether the solution meets the perturbations constraints as well as other constraints identified above.

According to a further aspect, there may be provided a computer program product comprising computer executable instructions which, when executed by one or more processors, cause the one or more processors to carry out the method of the first aspect. There may also be provided an implementation comprising one or more processors configured to carry out the learning method of the first aspect.

BRIEF DESCRIPTION OF THE FIGURES

Preferred embodiments of the present disclosure are described below with reference to the accompanying drawings, in which:

FIG. 1 illustrates an example system capable of formally verifying the robustness of a trained convolutional neural network to transformations of its input;

FIG. 2 illustrates a convolutional neural network;

FIG. 3 illustrates the effect of various transformations;

FIG. 4 shows an exemplary process for formally verifying the robustness of a trained convolutional neural network to transformations of its input implemented by the system of FIG. 1;

FIG. 5 shows experimental results demonstrating the efficacy of the proposed method; and

FIG. 6 illustrates further results demonstrating the efficacy of the proposed method.

DETAILED DESCRIPTION

FIG. 1 illustrates an example system capable of formally verifying the robustness of a trained convolutional neural network to transformations of its input. Such a system comprises at least one processor 102, which may receive data from at least one input 104 and provide data to at least one output 106.

The concepts of a convolutional neural network, a transformation, and local transformational robustness are now described with reference to FIGS. 2 and 3.

With reference to FIG. 2, a convolutional neural network (CNN) 200 is a directed acyclic graph structured in layers 210-230, such that each node of the graph belongs to exactly one layer. The first layer is said to be the input layer (not shown), the last layer is referred to as the output layer 230, and every layer in between is called a hidden layer.

The CNN takes in data as input 202 such as an image or an audio signal, and outputs a label 204 that can take one of a plurality of possible output classes.

The nodes in the input layer of the CNN reflect the input to the CNN. In all layers except the input layer, each node is connected to one or more nodes of the preceding layer, where each connection is associated with one or more weights.

Every layer in the CNN apart from the input layer is either a fully-connected layer 220 or a convolutional layer 210.

In a fully connected layer 220, each node is connected to every node in the preceding layer, and operates as follows. First, each node calculates the weighted sum 222 of its inputs according to the connection weights, to obtain a linear activation. Second, each node applies a non-linear activation function 224 to the linear activation to obtain its output. Typically, the non-linear activation function 224 is the Rectified Linear Unit (ReLU) whose output is the maximum between 0 and the linear activation, but may alternatively be any non-linear function, such as a logistic function, a tan h function, or a different piecewise-linear function. Where the activation function 224 is a piecewise-linear function, the function implemented by the layer may be readily expressed as a set of linear equality and inequality constraints.

In a convolutional layer 210, each node is connected to a rectangular neighbourhood of nodes in the preceding layer, and operates as follows. First, each node calculates one or more weighted sums of its inputs 212 according to the connection weights, to obtain one or more linear activations. Second, each node applies a non-linear activation function 214 to each of the one or more linear activations to obtain one or more activations. Typically, the non-linear activation function 214 is a ReLU, but may alternatively be any non-linear function, such as a logistic function or a different piecewise-linear function. Third, each node applies a pooling function 216 to the activations, collapsing the activations into a representative activation, which is the node's output. Typically, the pooling function 216 is the max-pooling function, which sets the representative activation to the maximum of the one or more activations, but could alternatively be another function such as the weighted sum of the one or more activations. Where the non-linear activation function 214 and the pooling function 216 are piecewise-linear functions, the function implemented by the layer may be readily expressed as a set of linear equality and inequality constraints.

The output layer 230 is a fully connected layer 220 that comprises precisely one node for each output class and whose activation function 234 outputs a label corresponding to the node with the largest linear activation.

In general, a CNN is adapted to solving the task of inferring a class from an input. Typically, a CNN may be used to solve an image classification task, where the input 202 is an image ℑ from an unknown distribution R^α·β·γ(where α·β are the pixels of the image and γ is the number of the colour bands, e.g. RGB). The task then concerns the approximation of an unspecified function ƒ⁺:R^α·β·γ→ custom-character 1, . . . , C that takes as input the image ℑ and determines among a set of classes 1, . . . ,c the class to which ℑ is a member. In another example, a CNN may be used to recognise a phoneme from an input representing an audio signal.

The task is solved by training a CNN by means of a training set comprising a plurality of training examples, each of which comprises an input 202 and its associated class. Training the CNN means setting the weights of the network so that its output 204 approximates ƒ*. Following this, the CNN can be used to infer the class 204 for a new input 202 by feeding the new input 202 to the input layer and then propagating it through the network.

To fix the notation, the set [1, . . . , n custom-character is denoted by [n], and [n₁, . . . , n₂ by [n₁, n₂]. A CNN CNN with a set of [n] layers is considered. The nodes in a convolutional layer are arranged into a three-dimensional array; interchangeably, this arrangement may be treated as reshaped into a vector. The nodes in a fully connected layer are arranged into a vector. The output of the (j, k, r)-th (j-th, respectively) node in a convolutional (fully connected, respectively) layer i is represented by x_j,k,r⁽ⁱ⁾(x_j⁽ⁱ⁾respectively). The vector of all the nodes' outputs in layer t is denoted x⁽ⁱ⁾. The size of layer i is denoted s⁽ⁱ⁾, and the size of the i-th dimension of convolutional layer i is denoted s_j⁽ⁱ⁾.

Every layer 2≤i≤n is a function ƒ⁽ⁱ⁾: R^s⁽ⁱ⁻¹⁾→R^s⁽ⁱ⁾.

If layer i is a fully-connected layer 220, the function ƒ⁽ⁱ⁾is defined as follows. The layer is associated with a weight matrix W⁽ⁱ⁾∈R^s⁽ⁱ⁾^·s⁽ⁱ⁻¹⁾and a bias vector b⁽ⁱ⁾∈R^s⁽ⁱ⁾. The linear activation of the layer is given by the weighted sum WS(x⁽ⁱ⁻¹⁾)=Wⁱ·x⁽⁻¹⁾+b⁽ⁱ⁾. The function computed by the layer can then be defined as ƒⁱ(x⁽ⁱ⁻¹⁾)=□(WS(x⁽⁻¹⁾)), where □∈{ReLU, Argmax} with the function ReLU (x)=m=x(0, x) being applied element-wise to the linear activation.

If layer i is a convolutional layer 210, the function ƒ⁽ⁱ⁾is defined as follows. The layer is associated with a group Conv^(i),1. . . Conv^(i),kof k≥1 convolutions and a max-pooling function Pool⁽ⁱ⁾. Each convolution Conv^(i),j: R^s⁽ⁱ⁻¹⁾→R^(s¹ⁱ⁻¹^−p+1)·(s²ⁱ⁻¹^−q+1)is parameterised over a weight matrix (kernel) K^(i),j∈R^p-q·s³⁽ⁱ⁻¹⁾, where p≤s₁ⁱ⁻¹and q≤s₂ⁱ⁻¹and a bias vector b_j⁽ⁱ⁾. The (u,v)-st output of the j-st convolution is given by

${Conv}_{u, v}^{(i), j} ({\overline{x}}^{(i - 1)}) = K^{(i), j} \cdot ? + {\overline{b}}_{j}^{(i)}, ? indicates text missing or illegible when filed$

where u′=u+p−1 and v′=v+q−1. Given the outputs of each of the convolutions, the linear activation of the layer Conv⁽ⁱ⁾:R^s⁽ⁱ⁻¹⁾→R^(s¹ⁱ⁻¹^−p+1)·(s²ⁱ⁻¹^−q+1)·kforms a three-dimensional matrix, i.e. Conv⁽ⁱ⁾=[Conv₁⁽ⁱ⁾. . . Conv_k⁽ⁱ⁾]. The non-linear activation of the layer is then computed by the application of the ReLU function. Finally, the max-pooling function collapses neighbourhoods of size p′-q′ of the latter activations to their maximum values. Formally,

${Pool}^{(i)} : R^{uvr} \to ?, ? indicates text missing or illegible when filed$

where u=(s₁ⁱ⁻¹−p+1) and v=(s₂ⁱ⁻¹−q+1) is defined as follows: Pool_u,v,r⁽ⁱ⁾=max(Conv_{[(u−1)p+1,u·p],[(v−1)q+1,v·q],r}⁽ⁱ⁾). The function computed by the layer is then defined by ƒⁱ(x⁽ⁱ⁻¹⁾)=Pool⁽ⁱ⁾(ReLU (Conv⁽ⁱ⁾(x⁽ⁱ⁻¹⁾))) Given the above, a convolutional neural network 200 CNN: R^α·β·γ→[c] can be defined as the composition of fully connected layers 22o and convolutional layers 210 CNN(x)=ƒ⁽ⁿ⁾(ƒ⁽ⁿ⁻¹⁾( . . . ƒ⁽¹⁾(x) . . . ))

where x∈R^α·β·γ, ƒ⁽¹⁾. . . ƒ⁽ⁿ⁻¹⁾are ReLU fully connected or convolutional layers, ƒ⁽ⁿ⁾is an Argmax fully connected layer, and [c] is a set of class labels.

A transformation is a parametrised function t that transforms a possible input to the CNN into another input according to a predetermined rule. The parameters of the transformation are named degrees of freedom and denoted by the tuple dof(t). For example, if t is a translation that shifts an image by t_x pixels in the horizontal direction and t_y pixels in the vertical direction, then dof(t)=(t_x, t_y). The set of possible values for the degrees of freedom is called the domain D⊆R^|dof(t)| of the transformation. The domain D may be a strict subset of R^{|dof(t)|, if it is desired to restrict the transformation to certain parameter values only. For example, it may only be necessary to ascertain that a CNN used to classify images is robust to certain small translations rather than all translations. Given d∈D, we denote by t[d] the concretisation of t whereby every parameter dof(t)}_iis set to d_i.

Typically, the domain D will be a linearly definable set, that is, that it is definable by linear equality and/or inequality constraints. For example, a simple range of values is a linearly definable set. If D is not linearly definable, it may be approximated to arbitrary precision by a linearly definable set.

Given an input, a transformation concisely describes the set of transformed inputs obtained by applying the transformation to the input. This set of transformed inputs may contain an extremely large number of elements, such that it may be computationally infeasible for them all to be explicitly constructed; however, this set is completely and concisely expressed by the input, the form of the transformation, and its domain.

An instance of transformation on an image is an affine transformation, which transforms the pixel at location (x,y) of the original image into the pixel at location (x′, y′) of the transformed image according to the following formula:

$(x^{'}, y^{'}, 1) = [\begin{matrix} a_{11} & a_{12} & t_{x} \\ a_{21} & a_{22} & t_{y} \\ 0 & 0 & 1 \end{matrix}] (\begin{matrix} x \\ y \\ 1 \end{matrix})$

where

$A = [\begin{matrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{matrix}]$

is a non-singular matrix, and t=(t_x, t_y)^Tis a translation vector. An affine transformation whereby A equals the identity matrix is said to be a translation. An affine transformation is referred to as scaling if A=σI, t_x=0 and t_y=0 for a scale factor σ. In the case where σ<1, the scaling is called subsampling, whereas in the case where σ>1 the scaling is known as zooming.

Another instance of transformation on an image is a photometric transformation, which is an affine change in the intensity of the pixels, applied identically to all the pixels in an image. It is defined as (p)=μp+v. If 0<μ<1, then the transformation reduces the contrast of the image. Otherwise, if μ>1, the transformation increases the contrast of the image. The factor v controls the luminosity of the image with higher values pertaining to brighter images.

FIG. 3 demonstrates instances of transformations that may be applied to an original image 301. The images 302-304 are each obtained by applying an affine transformation: a translation, a subsampling, and a zooming, respectively. The images 305 and 306 are each obtained by applying a photometric transformation: a brightness change and a contrast change, respectively. The random noise transformation of image 310 may represent a further parameterised transformation. It is understood that each of these example images are obtained at specific values of the degrees of freedom of the transformation; however, the transformations themselves may be defined over non-singular domains.

The examples of FIG. 3 illustrate the limitations associated with some earlier techniques for analysing the performance of CNN classifiers. In particular, some earlier techniques check whether, given an image ℑ, all images ℑ′ with ∥ℑ′−ℑ∥_p≤δ, for some L_p-norm, are classified as belonging to the same class as ℑ. This technique does not adequately reflect the nature of the potential variations in the sensor data input to perception systems. For example, with reference to FIG. 3, the L_∞-norm between the original image 301 and the brightened image 305 is ˜17.16, whereas the distance between the original image 301 and the image composed with random noise 310 is ˜16.02. However, it is intuitively clear that the brightened image should be classified as the original one whereas the classification of the noisy one is not as clear. This disconnect between the perceived similarity of images as perceived by a human and the similarity as measured by the L_∞_-norm indicates the limitations of approach based on such measures.

Where the output of the CNN is used to drive decisions, it is often a requirement that the CNN be robust to particular transformations of its input. This requirement may be formalised as the notion of local transformational robustness, described as follows. Given a transformation t with domain D⊆R^|dof(t)|, a convolutional neural network CNN is said to be locally transformationally robust (LTR) for an image ℑ and the transformation t, if for all d∈D we have that CNN (t[d](ℑ))=CNN(ℑ).

The particular transformations are typically chosen to reflect transformations of the input that are expected to occur in a practical deployment. Evaluating the robustness of the CNN to those transformations allows the suitability of the CNN for the practical deployment to be ascertained.

For example, the CNN may be an image classifier, which takes an image as input and outputs a class label. This CNN may be used in a practical deployment where it needs to always return the same label for an image of a given visual scene, regardless of the wavelength sensitivity of the camera, the angle and position of the camera, the camera's resolution, or the distance between the camera and the scene. This requirement may then be expressed in terms of robustness to certain transformations, such as affine and photometric transformations. Establishing that the CNN is LTR to those transformations therefore provides assurance that it will function robustly in its practical deployment.

In another example, the CNN may be a speech recognition classifier, which takes an audio signal as input and outputs a phoneme. In this case, establishing that the CNN is LTR to certain frequency-dependent transformations could validate that the CNN will always return the same label for an audio signal of a given utterance, regardless of the frequency response of the microphone.

Turning to FIG. 4, an example method 400 is described, capable of evaluating whether a convolutional neural network is locally transformationally robust for a given transformation and an input. The method 400 comprises steps 402-422. In the present example, the CNN is an image classifier, the input is an image, and the transformation is an arbitrary combination of a translation, a scaling, and a photometric transformation.

At step 402, the processor 102 is given a trained CNN and a sequence of one or more transformations t⁽¹⁾, . . . , t^(k); the method will then proceed to evaluate the local transformational robustness of the CNN to the transformation obtained by composing the transformations t⁽¹⁾, . . . , t^(k). The CNN may be specified, for example, by its architecture and weights. The sequence of one or more transformations may be specified, for example, by their forms t⁽¹⁾, . . . t^(k)and their domains D⁽¹⁾, . . . , D^(k). For example, the sequence of one or more transformations may be a translation, a scaling, and a photometric transformation, applied in that order, and a form and a range of parameters may be specified for each of the translation, the scaling, and the photometric transformation.

At step 404, a transformational CNN is constructed from the CNN and the sequence of one or more transformations. The transformational CNN CNN is constructed by treating the one or more transformations as additional layers and appending them to the input layer.

In some embodiments, a perturbation layer may also be added between the CNN layers and the additional layers corresponding to the one or more transformations to construct the transformational CNN. A perturbation layer is a layer that simulates a small perturbation of each pixel, up to a given constant ρ. In this way, robustness of the CNN to a combination of the transformations and of the small perturbations may be established. Since such small perturbations may commonly result from pixel interpolations occurring in image compression, transmission and encoding, verifying that the CNN is robust to such small perturbations may provide assurance that the practical deployment will function correctly despite differences in image compression, transmission and encoding setups.

Thus, the transformational CNN is constructed as CNN(x)=ƒ^n+k+2( . . . ƒ¹(x) . . . ) where ƒ^n+k+2, . . . , ƒ^k+2are the original layers of the CNN, ƒ^k+1may optionally be a perturbation layer, and ƒ^(k), . . . , ƒ⁽¹⁾=t^(k), . . . , t⁽¹⁾.

At step 406, the transformational CNN CNN is encoded into a set of equality and inequality constraints. This is achieved by first expressing the function performed by each layer i of the transformational CNN as equality and inequality constraints C⁽ⁱ⁾, and then aggregating the equality and inequality constraints for all the layers into a single set C^(CNN)=C⁽¹⁾∪ . . . ∪C⁽ⁿ⁾.

In the present example, for each layer i of the transformational CNN CNN, the equality and inequality constraints C⁽ⁱ⁾are a mixed-integer linear problem representation (MILP representation) of the layer. A MILP representation of a function ƒ⁽ⁱ⁾is a set C⁽ⁱ⁾of linear equality and inequality constraints on real-valued and integer variables that completely characterises the function ƒ⁽ⁱ⁾. All transformations and layer types considered in the present example—translation, scaling, photometric transformation, fully-connected layers with a ReLU activation function, and convolutional layers with a ReLU activation function and max-pooling—have a MILP representation. It will be evident to the person skilled in the art that a layer may alternatively be characterised by a set of linear equality and inequality constraints C⁽ⁱ⁾which is not a MILP representation of the layer, or even as a set of equality and inequality constraints C⁽ⁱ⁾which are not necessarily linear.

Where the equality and inequality constraints are linear, this has the advantage that very efficient dedicated solvers for linear equality and inequality constraints may be leveraged to determine local transformational robustness, such as MILP, SAT, CSP or SMT solvers. However, it is not essential that the equality and inequality constraints be linear, since there also exist very efficient dedicated solvers for problems that involve a mix of linear and non-linear equality and inequality constraints, such as quadratic programming and convex programming solvers.

In the following, the MILP representations of a translation, a scaling, a photometric transformation, a fully-connected layer with a ReLU activation function, and a convolutional layer with a ReLU activation function and max-pooling, are described.

The MILP representation of a photometric transformation is described as follows. A photometric transformation has two degrees of freedom: the factor;” that handles the contrast of the image, and the factor v which controls the luminosity of the image. The instantiations of the photometric transformation—that is, the possible values for the degrees of freedom of the transformation—may then be expressed by the following linear constraints:

λ_d≥min_d(D⁽ⁱ⁾) and λ_d≤max_d(D⁽ⁱ⁾) and D∈dof(t⁽ⁱ⁾ (C1)

where each λ_dis a newly-introduced variable controlling the values of the factor d, and min_d(D⁽ⁱ⁾) (max_d(D⁽ⁱ⁾), respectively) denotes the minimum (maximum, respectively) value for the factor d in D⁽ⁱ⁾.

The photometric transformation itself for each pixel px is encoded in the following constraint:

t
_[px]
ⁱ(x⁽ⁱ⁻¹⁾)=μ·x_px⁽ⁱ⁻¹⁾+v (C2)

- Given this, a photometric transformation t⁽ⁱ⁾may thus be described by its MILP representation C⁽ⁱ⁾=C₁∪C₂.

The MILP representation of an affine transformation is now described. For any affine transformation, a set of constraints of the form C1 is used, capturing the set of instantiations of the affine transformation. Also, for every instantiation d, a binary variable δ_d⁽ⁱ⁾is introduced. The variable represents whether the corresponding instantiation is the one being applied. The fact that exactly one instantiation is in use at any one time is imposed using the constraint

Σ_d∈D_(i)δ_d⁽ⁱ⁾=1 (C3)

A bijection is also forced between the set of δ variables and the instantiations they represent by assuming:

Σ_d∈D_(i)d_j·δ_d⁽ⁱ⁾=λ_jfor each d.o.f. in t⁽ⁱ⁾ (C4)

So δ_d⁽ⁱ⁾=1 if, for each d.o.f. j, the LP variable λ_jrepresenting j equals d_j.

The MILP representations of specific affine transformations, namely translation, subsampling, and zooming, are now described.

First, the MILP representation of a translation is described. A translation shifts the location of every pixel as per the translation vector (u′, v′):

$\begin{matrix} t_{u, v, r}^{i} ({\overline{x}}^{(i - 1)}) = ? {\overline{x}}_{u + u^{'}, v + v^{'}}^{(i - 1)}, \cdot δ_{u^{'}, v^{'}}^{(i)} ? indicates text missing or illegible when filed & (C5) \end{matrix}$

Therefore, a translation t⁽ⁱ⁾is described by its MILP representation C⁽ⁱ⁾=C₁∪C₃∪C₄∪C₅.

Second, the MILP representation of a subsampling is described. A subsampling collapses neighbourhoods of points to a point whose value is a statistic approximation of the neighbourhood. In the present example, the statistic approximation used is the arithmetic mean value. The size of the neighbourhood is controlled by the scaling factor d. This requirement is expressed as the following linear equality constraint:

t
_u,v,r
ⁱ(x⁽ⁱ⁻¹⁾)=Σ_d∈D_(i)AM(x|[(u−1)d,u,d],[(v−1)d,v,d],r⁽ⁱ⁻¹⁾)·δ_d⁽ⁱ⁾ (C6)

It follows that a subsampling t⁽ⁱ⁾is described by its MILP representation C⁽ⁱ⁾=C₁∪C₃∪C₄∪C₅.

Third, the MILP representation of a zooming is described. A zooming replicates the value of a pixel to a rectangular neighbourhood of pixels. The value of the neighbourhood is controlled by the scaling factor d.

$\begin{matrix} ? ({\overline{x}}^{(i - 1)}) = ? ? \cdot ? ? indicates text missing or illegible when filed & (C7) \end{matrix}$

Therefore, a zooming t⁽ⁱ⁾is described by its MILP representation C⁽ⁱ⁾=C₁∪C₃∪C₄∪C₇. For the case of an arbitrary combination of a translation, a scaling, a shear, and a rotation, an alternative description of step 406 is given which uses non-linear constraints. In the present example, the composition of the transformations has a Mixed Integer Non-Linear Programming (MINLP) representation. MINLP allows for the description of the inverse of the matrix of the composition of the geometric transformations. Therefore, MINLP allows for the description of the composition of the geometric transformations with interpolation. Interpolation enables the discrete pixel representation of the application of the transformations to a given image. The present example composes nearest-neighbour interpolation with the transformations. It will be evident to the person skilled in the art that alternative interpolation methods can also be used, such as bi-linear interpolation.

In the following, the MINLP representation of the composition of the transformations with nearest-neighbour interpolation is described. To enable the representation, shears are restricted along the x- or y-axis and rotations are linearly approximated. For a sequence of one or more transformations t⁽¹⁾, . . . , t^(k)with domains D⁽¹⁾, . . . , D^(k), the representation is a set of MINLP constraints expressing the composition of t⁽¹⁾, . . . , t^(k)with nearest-neighbour interpolation. This is achieved in three steps. The first step inverts the matrix of the composition of the geometric transformations. For each pixel p′ of the transformed image, the inverted matrix is used to determine the pixel (M₁. . . M_k)⁻¹p′ from the original image from which p′ should obtain its value (where each M_iis the matrix of transformation t⁽ⁱ⁾). The second step identifies the nearest pixel from the input image to (M₁. . . M_k)⁻¹p′. The third step assigns the value of the nearest pixel from step 2 to p′.

The MINLP representation of the first step is described as follows. The step constructs a set of constraints representing the inverted matrix of the composition of the transformations. The constraints are defined by the following:

$\begin{matrix} {λ_{d}^{(i)} \geq \min_{d} (D^{(i)}, I_{d}^{(i)} \leq \max_{d} (D^{(i)} ❘ i \in [k], d \in dof (D^{(i)})} & (C_{1}^{'}) \\ I_{1} = M_{k}^{- 1}, I_{2} = M_{k - 1}^{- 1} I_{1,} \dots, I_{k} = M_{1}^{- 1} I_{k - 1} & (C_{2}^{'}) \end{matrix}$

where each λ_d⁽ⁱ⁾is an MINLP variable expressing the possible instantiations of factor d of the transformation t⁽ⁱ⁾, and l₁, l₂, . . . , l_kare matrices of MINLP variables expressing M_k⁻¹, M_k⁻¹M_k−1⁻¹, M_k⁻¹M_k−1⁻¹, . . . M₁⁻¹respectively. The value of the variables l₁, l₂, . . . , l_kis non-linearly derived in C′₂, therefore the C′₂constraints are not expressible in MILP.

The MINLP representation of the second step is described as follows. The step extends the MINLP program constructed by the first step to encode the nearest neighbour of (M₁. . . M_k)⁻¹p′. To do this, it first builds a set of MINLP constraints representing the distance between (M₁. . . M_k)⁻¹p′ and each of the points of the input image as measured by the L₁norm. The constraints it generates are given by the following: Dist(l_k·p′,p)=∥l_k·p′−p∥₁, for each pixel p (C′₃) The L₁norm is a piecewise-linear function and can therefore be encoded in MINLP by means of the big-M method. Following the construction of C′₃, the second step builds a set of constraints to identify the point p such that ∥l_k·p′−p∥₁minimum. This is expressed by the following:

mindist=min(Dist(l_k·p′,p),for each pixel p (C′₄)

The minimum function is a piecewise-linear function and can thus be expressed in MINLP by using the big-M method.

The MINLP representation of the third is described as follows. The step takes as input the constraints from the second step and the image under question. It then constructs a set of constraints that encode the assignment of the value of the nearest neighbour of l_k·p′ to p′. The constraints are defined as follows:

$\begin{matrix} δ_{p} = 1 ⟹ Dist (l_{k} \cdot p^{'}, p) = mindist, for each pixel p & (C_{5}^{'}) \\ \sum δ_{p} = 1 & (C_{6}^{'}) \\ val (p^{'}) = \sum_{p} val (p) δ_{p} & (C_{7}^{'}) \end{matrix}$

The above constraints use a binary variable δ_pper pixel p. It is required by C′₅that if a variable is equal to 1, then the pixel associated with the variable is the nearest neighbour to l_k·p′. The implication constraints in C′₅are expressible in MILNP through the big-M method. The constraint C′₆insists on exactly one of the binary variables to equal 1. Therefore, by C′₇, p′ is assigned the value of the nearest neighbour of l_k·p′.

Therefore, the composition of t⁽¹⁾, . . . , t^(k)is alternatively described by its MINLP representation C′₁∪ . . . ∪C′₇. Differently from the MILP representation, the MINLP representation composes t⁽¹⁾, . . . , t^(k)with interpolation.

The MILP representation of a perturbation layer ƒ⁽ⁱ⁾is given as follows. For each pixel px, the variation of the pixel between the input to the perturbation layer must be less than p in magnitude. Therefore, the perturbation layer may be expressed by two constraints for each pixel px:

x

_px
⁽ⁱ⁺¹⁾
−x
_px
^(i)≤p (C8)

x

_px
⁽ⁱ⁺¹⁾
−x
_px
^(i)≥p (C9)

A perturbation layer ƒ⁽ⁱ⁾is thus described by its MILP representation C⁽ⁱ⁾=C₈∪C₉. The MILP representation of a fully-connected layer is now described. The weighted sum function is encoded as the following constraint:

WS(x⁽ⁱ⁻¹⁾)_jW_j⁽ⁱ⁾·x_j⁽ⁱ⁻¹⁾+b_j⁽ⁱ⁾ (C10)

To capture the piecewise-linearity of the ReLU function, a binary variable δ_j⁽ⁱ⁾is introduced for each node j that represents whether the output of the node is above 0. The ReLU may therefore be expressed as the following inequality constraints:

ReLU(x⁽ⁱ⁻¹⁾)_j≥0 (C11)

ReLU(x⁽ⁱ⁻¹⁾)_j≥WS(x⁽ⁱ⁻¹⁾)_j (C12)

ReLU(x⁽ⁱ⁻¹⁾)_j≤WS(x⁽ⁱ⁻¹⁾)_j+Mδ_j⁽ⁱ⁾ (C13)

ReLU(x⁽ⁱ⁻¹⁾)_j<M(1−δ_j⁽ⁱ⁾) (C14)

In the above inequalities, M denotes a sufficiently large number.

Therefore, a fully connected layer ƒ⁽ⁱ⁾is described by its MILP representation C⁽ⁱ⁾=C₁₀∪ . . . ∪C₁₄.

The MILP representation of a convolutional layer is as follows. In addition to the ReLU phase, a convolutional layer includes a convolution and a max-pooling phase. Similarly to the weighted-sum function, a convolution is a linear operation on the input of the layer and can be encoded by the following:

$\begin{matrix} {Conv}_{u, v}^{(i), j} = K_{j}^{(i)} \cdot ? + {\overline{b}}_{j}^{(i)} where u^{'} = u + p - 1, v^{'} + q - 1 ? indicates text missing or illegible when filed & (C15) \end{matrix}$

A max-pooling function is parameterised over the size of the groups of pixels over which the max-pooling is performed. Previous linear encodings of the function use a binary variable per node in a group; here, an encoding is provided that uses logarithmically less variables. Specifically, to select the maximum value from a group, a sequence of binary variables is introduced. The number in base-2 represented by the binary sequence expresses the node in a group whose value is maximum. Since the size of the group is p·q, ┌log₂(p·q)┐ binary variables are needed to represent the node whose value is maximum. To facilitate the presentation of the corresponding linear constraints, we write n for the binary representation of n∈Z′. We denote by |n| the number of binary digits in n. Given j∈|n| expresses the j-th digit in n whereby the first digit is the least significant bit. If i>|n|, then we assume that n_j=0. The linear representation of the max-pooling function for a pixel px=(px_α, px_β, px_γ) and pool size pxq is given by the following.

Pool_px⁽ⁱ⁾≥Conv_(px_α−i_)p+u′,(px_β−1_)q+v′,px_γ⁽ⁱ⁾,μ′∈[p],v′∈[v] (C16)

Pool_px⁽ⁱ⁾≤Conv_(px_α−i_)p+u′,(px_β−1_)q+v′,px_γ⁽ⁱ⁾,+MΣ_j∈[|p,q|]z_j+(1−2z_j)δ_px,j⁽ⁱ⁾,u′∈[p],v′∈[v],z=(u′−1)q+v′−1 (C17)

where δ_px,1⁽ⁱ⁾, . . . , δ_┌log₂_(px,p·q)┐⁽ⁱ⁾are the binary variables associated with px.

For the case where p·q is not a power of 2, it is required that the number represented by δ_px,1⁽ⁱ⁾, . . . , δ_┌log₂_(px,p·q)┐⁽ⁱ⁾lie within 0, . . . , (p,q−1), which is formally expressed through the following constraint:

Σ_j∈|p,q|z_j+(1−2z_j)δ_px,j⁽ⁱ⁾≥1,z∈[p,q,2^|p,q|−1] (C18)

Thus, a convolutional layer ƒ⁽ⁱ⁾is described by its MILP representation C⁽ⁱ⁾=C₁₅∪ . . . ∪C₁₆.

Given the above, the set of constraints describing a transformational CNN CNN(x)=ƒ⁽ⁿ⁾(ƒ⁽ⁿ⁻¹⁾( . . . ƒ₋₁(x) . . . )), is obtained as the union of the constraints characterising its layers, that is, C^(CNN)=C⁽¹⁾∪ . . . ∪C⁽ⁿ⁾.

At step 408, the processor 102 is given a labelled input ℑ to the CNN, with label l_ℑ, at which local transformational robustness to the transformation is to be evaluated. For example, this may be an input whose true class is known and at which the CNN may be expected to be robust to the transformation.

At step 410, the local transformational robustness requirement at the input is encoded into a set of equality and inequality constraints.

First, equality and inequality constraints are generated that specify that the input to the transformational CNN is set to the given labelled input ℑ. In the present example, the equality constraint used is C^(ℑ) custom-character x⁽⁰⁾=ℑ, which fixes the input of t to ℑ.

Second, equality and inequality constraints are generated that specify that there is a linear activation in the output layer that is larger than the activation associated with the label l_ℑ. This is achieved similarly to the encoding of the max-pooling function: a sequence of [log₂c] binary variables δ_O⁽ⁿ⁾, . . . , δ_[log₂_c]⁽ⁿ⁾is introduced, where the sequence's binary number b denotes the node from the output layer that is associated with class b+1 ∈[c] and whose linear activation is larger than the linear activation of the node associated with l_ℑ. The constraint is then expressed as

$\begin{matrix} WS ({\overline{x}}^{(i - 1)} ? \leq {WS ({\overline{x}}^{(i - 1)})}_{j} + M ? (j_{k} + (1 - 2 j_{k}) δ_{k}^{(n)}, j \in [0, c - 1] \(? - 1) ? indicates text missing or illegible when filed & (C19) \end{matrix}$

Moreover, the variables δ_O^[n], . . . , δ[log₂_c]⁽ⁿ⁾are prevented from representing l_ℑor any number greater than c-1 using the following constraint:

Σ_k∈|c|j_k+(1−2j_k)δ_k⁽ⁿ⁾≥1,j∈(l_ℑ−1)∪[c,2^|c|−1] (C20)

The requirement of local robustness is thus described by the linear inequality constraints C^(lrob)=C₁₉∪C₂₀.

At step 412, all the constraints obtained at step 306 and 310 are aggregated into a set of constraints C^(all)=C^(ℑ)∪C^(CNN)∪C^(lrob). Recall that the constraints C^(ℑ)specify that the input to the CNN is the given labelled input image, the constraints C^(CNN)specify the CNN and the transformation over its domain, and the constraints C^(lrob)specify that there is a linear activation in the output layer that is larger than the activation associated with the label of the labelled input. Therefore, if a solution to the set of constraints C^(all)exists, the CNN is not locally transformationally robust to the transformation at the labelled image. Conversely, if the set of constraints does not admit a solution, the CNN is locally transformationally robust. This is proven in the following theorem:

Theorem i: Let CNN be a CNN, t a transformation with domain D, and ℑ an image. Let LP be the linear problem defined on objective function obj=0 and set of constraints C^(ℑ)∪C^(CNN)∪C^(lrob), where C^(ℑ) custom-character x^(α)=ℑ fixes the input of t to ℑ. Then CNN is locally transformationally robust for t and ℑ if LP has no solution.

Proof: Let CNN=ƒ⁽ⁿ⁺¹⁾( . . . ƒ⁽¹⁾(ℑ) . . . ). For the left to right direction assume that LP has a feasible solution. Consider d=(λ_j: j∈dof(t)), where each λ_jis the value of the d.o.f. d_jof t in the solution. By the definition of LP we have that CNN(t[d]ℑ)=Argmax (x⁽ⁿ⁺¹⁾). By the definition of C^(lrob)there is 1 with l≠l_ℑand

${\overline{x}}_{i}^{(n + 1)} \geq ? . ? indicates text missing or illegible when filed$

Therefore CNN(t[d]ℑ)≠l_ℑ, and therefore CNN is not locally transformationally robust. For the right to left direction suppose that CNN is not locally transformationally robust. It follows that there is d∈D such that CNN(t[d]ℑ)≠l_ℑ. Then, the assignment λ_j=d, for each d.o.f j of t and x_j⁽ⁱ⁾=ƒ⁽ⁿ⁺¹⁾( . . . ƒ⁽¹⁾(ℑ) . . . ) is a feasible solution for LP.

At step 414, the processor 102 determines whether the constraints C^(all)admit a solution. This may be done using any suitable optimisation solver, such as a simplex-method or interior-point solver in the case of linear constraints, or a convex optimisation solver if appropriate.

In the present example, each constraint of C^(all)is a linear equality or inequality constraint on real-valued and integer variables. A set of constraints where each constraint is a linear equality or inequality constraint on real-valued and integer variables is said to be a mixed-integer linear problem (MILP); in the present example, C^(all)is thus a mixed-integer linear problem (MILP). There exist dedicated programs that are able to ascertain whether a MILP admits a solution, and return such a solution if it exists. For example, the Gurobi MILP solver is such a program.

In the present example step 414 is carried out by the Gurobi MILP solver determining whether the mixed-integer linear problem C^(all)admits a solution.

If at step 416, it is found that no solution to the constraints C^(all)exists, the method therefore determines that the CNN is locally transformationally robust at step 418. As a result, the safety of the CNN under a range of practically relevant conditions may be established.

If, on the other hand, one or more solutions are found that fulfil the constraints C^(all), the method moves to step 420.

At step 420, one or more adversarial examples are generated from the one or more solutions that fulfil the constraints C^(all). An adversarial example is an input obtained by transforming the labelled input using the transformation, which is classified differently than the labelled image by the CNN.

The one or more adversarial examples are generated as follows. Each of the one or more solutions describes a value for the transformation's degrees of freedom d, such that applying the transformation with the degrees of freedom set to d to the labelled image results in an image which is classified differently than the labelled image by the CNN.

Therefore, for each of the one or more solutions to the constraints C^(all), the values of the degrees of freedom d specified in the solution may be obtained, and an adversarial example may be generated as ℑ^(adv)=t[d](ℑ).

Thus, the method guarantees the generation of adversarial examples whenever they exist, in contrast to previous formal verification approaches in adversarial learning, where adversarial examples may not be identified even if these exist.

The method then advances to step 422, where the one or more adversarial examples are used as training examples to further train the CNN. As a result of training the CNN using the adversarial examples, the CNN may learn to classify the adversarial examples correctly. Consequently, the robustness of the CNN may be improved, so that the CNN may be made more suitable for a practical deployment where distortions represented by the transformation are to be expected.

Once the one or more adversarial examples have been used as training examples to further train the CNN, the method 400 may be performed repeatedly on the further trained CNN to improve the robustness of the CNN. For example, method 400 may be repeatedly performed until the CNN is shown to be locally transformationally robust.

With reference to FIG. 5, experimental results are now described. Method steps 400 was implemented by a toolkit VERA that takes as input the descriptions of a CNN and a sequence of transformations against which the local transformational robustness of the CNN is meant to be assessed. Following this VERA builds the linear encoding of the verification query according to steps 402-412.

Having constructed the linear program VERA invokes the Gurobi checker to ascertain whether the program admits a solution. The satisfiability output of the latter corresponds to the violation of the local transformational robustness property of the CNN, whereas the unsatisfiability output can be used to assert the CNN is transformationally locally robust.

VERA has been tested on CNNs trained on the MNIST dataset using the deep learning toolkit Keras. Since currently there are no other methods or tools for the same problem only the results obtained with VERA are reported. In the experiments, a CCN of 1481 nodes was used, with a convolutional layer of 3 convolutions with kernels of size 15×15 and pool-size 2×2 and an output layer with 10 nodes. The accuracy of the network on the MNIST dataset is 93%. To check the network's transformational local robustness 100 images were selected for which the network outputs the correct classification label. Experiments were then performed for translation, subsampling, zooming and photometric transformations with varying domains for each of their degrees of freedom, with results summarised in FIG. 5. The experiments were run on a machine equipped with an i7-7700 processor and 16 GB of RAM and running Linux kernel version 4.15.0.

FIG. 5 reports the number of images verified under the timeout of 2009, irrespective of whether these were shown robust, followed by the average time taken for verifying said images. This is indicated in the column #√(s). For example, for subsampling transformations on domain [2,3], the method could verify 97 images out of 100 with an average time of 2 secs. Furthermore, the LTR column reports the number of images that were determined to be locally transformationally robust.

Note that there is some variability in the results. For example, several images could not be assessed by the timeout for the translation with domain [−1; 1] but many more could be analysed under the translation domain [−3; 3]. This is likely to be due to optimisations carried out by Gurobi which can be applied in some cases only. Indeed, note in general that an increase in the range of the domains does not lead to longer computations times, since the resulting linear program is only marginally extended.

In summary, the results show that the CNN built from the MNIST dataset is not locally transformationally robust with respect to translation, subsampling and zooming, returning different classifications even for small transformational changes to the input. The CNN appears just as fragile in terms of luminosity and contrast changes. Overall, the results show that the CNN in question is brittle with respect to transformational robustness.

Furthermore, with reference to FIG. 6, the efficacy of method 400 to improve the robustness of the CNN via data augmentation was experimentally evaluated for translation, scale, shear, rotation, brightness, and contrast. The results were then compared with results obtained from traditional data augmentation schemes. For each of the transformations, the experimental plan consisted of the following three steps.

In the first step, a transformed test set was generated by applying the transformation to each of the images of the original test set. For each of the transformed images a random instantiation of the transformation that was uniformly sampled from its domain was used. The second column of FIG. 6 records the accuracy of the CNN on the transformed test set. The CNN exhibits very poor generalisation to affine transformations and good generalisation to photometric transformations.

In the second step, twenty correctly classified images were sampled from the original training set. These were passed to VERA to generate the augmentation set. Then, the training set was enlarged with the augmentation set and the network was retrained.

In the third step, the first and second steps were performed again by using standard augmentation methods whereby random instantiations of the transformation are applied to images from the original training set.

The three steps were repeated for three iterations. For each iteration, FIG. 6 reports the size of the augmentation set w.r.t original training set (first column), and the accuracy of the resulting model and average time each augmentation method took per image (column 2 for standard augmentation and column 3 for verification-based augmentation). For example, for translation, standard augmentation achieves 64% accuracy with os average time, and verification-based augmentation achieves 65% accuracy with 161 s average time.

The results show that verification-based augmentation achieves higher accuracy than standard augmentation methods.

The observed variability in the results accounts for the varying sizes of the augmentation sets in conjunction with the different accuracies exhibited by the models in each of the iterations. As the augmentation set grows and the accuracy of the classifier improves, the enlargement of the training set with counterexamples is more beneficial to the improvement of the classifier's accuracy than its enlargement with random transformations.

Variations and modifications of the specific embodiments described above will be apparent to the skilled person. For example, alternative forms of classifier neural network may be adopted as appropriate. In general, a perception classifier may classify sensor data. Similarly, while the system of FIG. 1 is illustrated in a particular form, the skilled person will recognise that alternative hardware and/or software elements may be adopted to carry out the method described in the present disclosure.

Verification Of Perception Systems

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)

PCT Information