The present disclosure relates generally to the field of computer vision technology. More specifically, the present disclosure relates to computer vision systems and methods for optimized computer vision using deep neural networks and Lipschitz analysis.
Convolutional neural network (“CNNs”) are widely used in machine learning and are an effective tool in various image processing tasks, such as classification of objects. In particular, CNNs can be used as feature extractors to extract different details from images to identify objects in the images. As a feature extractor, CNNs are stable with respect to small variations in the input data, and therefore, perform well in a variety of classification, detection and segmentation problems. As such, similar features are expected when inputs are from the same class.
The stability to deformation of certain CNNs can be attributed to sets of filters that form semi-discrete frames which have an upper bound equal to one. This deformation stability is a consequence of the Lipschitz property of the CNN or of the feature extractor. As such, the upper bound can be referred to as a Lipschitz bound.
However, current CNNs can be fooled by changing a small number of pixels, thus leading to an incorrect classification. This can be the result of an instability of the CNN due to a large Lipschitz bound or a lack of one. Therefore, there is a need for computer vision systems and methods which can determine the Lipschitz bound for different types of CNNs, thereby improving the ability of computer vision systems to tolerate variations in input data. These and other needs are addressed by the computer vision systems and methods of the present disclosure.
The present disclosure relates to computer vision systems and methods for optimized computer vision using deep neural networks and Lipschitz analysis. A neural network, such as a CNN, is a multiple layer network with learnable weights and biases that can be used for, among other things, analyzing visual imagery. The system of the present disclosure receives signals or data related to the visual imagery, such as data from a camera, and feed-forwards the signals/data through the multiple layers of the CNN. At one or more layers of the CNN, the system determines at least one Bessel bound of that layer. The system then determines a Lipschitz bound based on the one or more Bessel bounds. The system then applies the Lipschitz bound to the signals. Once the Lipschitz bound is applied, the system can feed-forward the signals to other processes of the layer or to a further layer.
The foregoing features of the invention will be apparent from the following Detailed Description of the Invention, taken in connection with the accompanying drawings, in which:
The present disclosure relates to computer vision systems and methods for optimized computer vision using deep neural networks and Lipschitz analysis as described in detail below in connection with
By way of background and before describing the systems and methods of the present disclose in detail, the structure and properties of convolutional neural networks (“CNNs”) will be discussed first. It should be noted that the CNNs discussed below relate to a generic CNN. However, those skilled in the art would understand that the method and exemplary embodiments in this disclosure can pertain to any CNN, including but not limited to, scattering CNNs, fully connected CNNs, sparsely connected CNNs, etc.
A CNN can contain multiple layers, where each layer can consist of different or similar features.
The input node 102 can process one or more signal(s) and/or data, such as image data (e.g., pixels) or audio data. The input node 102 can be derived from an output node of a previous layer of the CNN or, when a layer of the CNN is a first layer, the input node 102 can be an initial input or signal. For example, a camera can be positioned to record traffic patterns in a particular area. The data from the camera can be fed into the CNN, where each image can be converted into an input node 102 that is fed into a first layer of the CNN. The first layer can then apply its filters and operations to the input signal of the input node 102 and produce an output node 112, which can then be fed into a second layer of the CNN.
The convolution filter 104 is a filter that can apply a convolution operation to the signal from the input node 102. For example, the operation can apply one or more convolution filters to different sections of the input signal. The result of the operation produces an output signal that is fed-forward to a next filter or process in the next layer of the CNN, or the result can be transformed into an output node. It should be noted that the input signal can be described with the symbol (“y”) and the convolution filter 104 can be described with the symbol (“g”) throughout the figures and description of the present disclosure.
The dilation operation 106 is an operation that can dilate an element of the output signal and/or data by a predetermined factor. For example, if the signal and/or data is represented in a 3×3 matrix, the dilation operation can dilate the 3×3 matrix into a 7×7 matrix. It should be noted that the dilation operation 106 can be described with the symbol (“↓D”) throughout the figures and description of the present disclosure.
The detection operation 108 is a nonlinear operation(s) that is applied pointwise to the output signal from the convolution filter 104. For example, the nonlinear operation can be a Lipschitz constant, a rectified linear unit (“ReLU”), a sigmoid, etc. The application of the nonlinear operation can improve the robustness of the CNN by preventing instability. Specifically, the nonlinear operation can prevent a value(s) from the input signals from uncontrollably increasing and becoming unmanageable when the values are processed through, for example, the merge filter 110 and the pooling filter 114. It should be noted that the detection operation 108 can be described with the symbol (“σ”).
To optimize the performance of computer vision systems which rely on CNNs, the present disclosure determines an optimal Lipschitz bound for the CNN. This determination can occur during the detection operation 108. By determining the optimal Lipschitz bound, they can be a significant reduction in identification errors, which would enable the CNN to conduct a more accurate analysis. Thus, the methods and embodiments discussed herein produce a significant improvement in the functioning of computer vision systems. Processes for determining the optimal Lipschitz bound are discussed in more detail below.
The merge filter 110 is a filter that merges two or more outputs from the detection operation 108 by a pointwise operation to produce a single output.
=Θj=1kσj(j)
The second example is p-norm aggregation 204, which can, for example, take input signals y1, y2, yk from a filter, apply a nonlinearity function σ1, σ2, σk, respectively, and apply a pointwise p-norm aggregation. The output of the p-norm aggregation 204 can be defined by the formula:
The third example is pointwise multiplication merging 206, which can, for example, take input signals y1, y2, yk from a filter, apply a nonlinearity function σ1, σ2, σk, respectively, and apply a pointwise multiplication. The output of the pointwise multiplication merging 206 can be defined by the formula:
=√j=1kσj()
It should be noted that the merge filter 110 of
The output node 112 of
Returning to
A second example of the pooling filter 114 is average pooling, which is shown in
It should be understood that the structure of the layer in
It should further be noted that the merging operations of the merge filter 110 and the pooling operations of the pooling filter 114 aggregate input signals from the input nodes 102 filters and/or signals. As noted discussed above, a nonlinear operation, such as an operation which imposes a Lipschitz bound, can prevent the values from the input signals from uncontrollably increasing and becoming unmanageable when the values are processed through the merge filter 110 and the pooling filter 114. Additionally, a Lipschitz bound can be imposed on a signal, after which the signal can proceed to a next layer.
In step 602, a first layer of the CNN receives a first input node. The first input node includes a first input signal. For example, the input signal can be a matrix representative of an image. In step 604, the input signal can pass through a first filter. The first filter can be the convolution filter 104. The convolution filter 104 applies the convolution operation to the input signal. The result of the operation produces an output signal that is fed-forward to the detection operation 108.
In step 606, the detection operation 108 receives the output signal and determine at least one type of Bessel bound for the first layer. Two set of formulas for determining the different types of Bessel bounds will be discussed. The first set of Bessel bound formulas, seen below, can be used to determine three types of Bessel bounds if the first layer does not contain the merge filter 110:
The second set of Bessel bound formulas, seen below, can be used to determine three types of Bessel bounds if the first layer contains the merge filter 110.
It should be understood by those skilled in the art that the amount and type of Bessel bounds determined in 606, as well as the formulas for determining each type of Bessel bound, can be different depending on the filters and operations in a layer (e.g., the first layer). Thus, the two sets of formulas used in this disclosure to determine three types of Bessel bounds based on whether the first layer contains the merge filter 110 is only exemplary.
In step 608, the detection operation 108 can determine the Lipschitz bound for the first layer. The Lipschitz bound can be determined based on the type of the Bessel bound(s) determined in step 606. Bessel bounds Bm(1), Bm(2), and Bm(3) can be determined in step 606 and, in step 608, the Lipschitz bound can be determined by the following first Lipschitz calculation:
As discussed above, the value “z” relates to the output value of the merge operation used by the merge filter 110.
The Lipschitz bound can also be determined in step 608 by the following second Lipschitz calculation using only Bessel bound Bm(1).
Additionally, the Lipschitz bound can also be determined in step 608 by the following third Lipschitz calculation using only Bm(2) and Bm(3).
In step 610, the determined Lipschitz bound is applied to the output signal. It should be understood that in step 608, the first Lipschitz calculation, the second Lipschitz calculation and the third Lipschitz calculation can produce different Lipschitz bound values. As such, in step 610, the Lipschitz bound value that is closest to optimality can be selected. Alternatively, a different Lipschitz bound value can be selected based on a predetermined parameter. In step 612, the output signal is fed-forward to the next process or filter in the first layer or to a next layer. For a first example, the output signal can be fed-forward to a merge filter 110 or a pooling filter 114. Alternatively, the output signal can be fed-forward to a next layer of the CNN.
The function on the Fourier domain supported on (−1,1) is defined as:
The Fourier transforms of the filters to be C∞ gate function are defined as:
Applying the first Lipschitz calculation to the determined Bessel bounds in Table 1 produces a Lipschitz bound of 2.866. Applying the second Lipschitz calculation to the determined Bessel bounds in Table 1 produces a Lipschitz bound of 4.102. Applying the third Lipschitz calculation to the determined Bessel bounds in Table 1 produces a Lipschitz bound of 5. As such, the Lipschitz bound value of 2.866, as determined by the first Lipschitz calculation, is the closest to optimality. Thus, for example, the Lipschitz bound value of 2.866 can be selected in step 610 of
The functionality provided by the present disclosure could be provided by computer vision software code 806, which could be embodied as computer-readable program code stored on the storage device 804 and executed by the CPU 812 using any suitable, high or low level computing language, such as Python, Java, C, C++, C#, .NET, MATLAB, etc. The network interface 808 could include an Ethernet network interface device, a wireless network interface device, or any other suitable device which permits the server 802 to communicate via the network. The CPU 812 could include any suitable single-core or multiple-core microprocessor of any suitable architecture that is capable of implementing and running the computer vision software code 806 (e.g., Intel processor). The random access memory 814 could include any suitable, high-speed, random access memory typical of most modern computers, such as dynamic RAM (DRAM), etc.
Having thus described the system and method in detail, it is to be understood that the foregoing description is not intended to limit the spirit or scope thereof. It will be understood that the embodiments of the present disclosure described herein are merely exemplary and that a person skilled in the art can make any variations and modification without departing from the spirit and scope of the disclosure. All such variations and modifications, including those discussed above, are intended to be included within the scope of the disclosure. What is intended to be protected by Letters Patent is set forth in the following claims.
The present application claims the priority of U.S. Provisional Application Ser. No. 62/685,460 filed on Jun. 15, 2018, the entire disclosure of which is expressly incorporated by reference.
Number | Date | Country | |
---|---|---|---|
62685460 | Jun 2018 | US |