Embodiments of the present invention relate to systems and methods for classification, including learning of representations for use in classification systems from data, where the classification systems are robust to unknown inputs. More particularly, embodiments related to systems and methods where inputs from unknown classes of data are represented in such a manner that the system has improved the separation between inputs from multiple known and from unknown classes, and also address methods for improving such separation by machine learning using a class of robust loss functions. In other words, the invention improves on the multi-class recognition systems by providing robustness to data from inputs that come from classes other than that which they were designed to handle or from which they were trained.
There are many systems designed to detect or recognize a wide range of objects. Such systems are developed around a set of classes of interest. However, when used in a general setting, there is a significant probability that such systems will have to process data from other unknown classes, e.g., a visual recognition system may see new objects, a system analyzing human behavior will see novel behaviors, a medical diagnostic system is presented with new diseases, and a security system will see new attacks. In a system that detects or recognizes objects, the ability to robustly handle such unknown data is critical. This invention addresses how to improve the ability to detect or recognize correct classes while reducing the impact of unknown inputs.
In order to formalize the discussion and better understand the problem, let us assume ⊂ be the infinite label space of all classes which can be broadly categorized into:
Classification and recognition systems have a long history with many inventions. Ever since a convolutional neural network (CNN) won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), the extraordinary increase in the performance of deep learning architectures has contributed to the growing application. Interestingly, though each year new state-of-the-art-algorithms emerge from each of these domains, a crucial component of their architecture remains unchanged—handling unwanted or unknown inputs.
For traditional learning systems, learning with rejection or background classes have been around for decades, for example, see Chi-Keung Chow: “An optimum character recognition system using decision functions.” IRE Transactions on Electronic Computers, (4):247-254, 1957, and C. K. Chow: “On optimum recognition error and reject tradeoff,” IEEE Transactions on Information Theory, IT-16, no. 1, pp. 41-46, 1970. These works and the many extensions to it assume we have exact knowledge of the class statistics and/or probability distributions.
More recent inventions have expanded on these ideas. For example, U.S. Pat. No. 6,438,519, issued Aug. 20, 2002, to William Michael Campbell and Charles Conway Broun entitled “Apparatus and method for rejecting out-of-class inputs for pattern classification.” The #519 patent teaches of an approach that is simple thresholding of ranking of classification score, a slight variation of the original Chow approach of thresholding score. The threshold is determined from a ranking tolerance. The approach presumes the feature representation, and the classifiers are both fixed, and the patent teaches only how to select among outputs.
The formulation of U.S. Pat. No. 6,690,829, issued Feb. 10, 2004, to Ulrich Kressel, Frank Lindner, and Christian Wohler entitled “Classification system with reject class.” offers a more general model where they include items from both known inputs c′ and undesired inputs b′. They try to also reject unknowns from other classes, using a rejection threshold that is determined using the inputs from c′ and b′. The approach presumes the feature representation, and the classifiers are both fixed. The patent teaches only how to select a threshold on classification scores to reject unknown inputs from u.
More recently, U.S. Patent US10133988B2 issued 2018 Nov. 20 to Pedro Ribeiro Mendes Júnior, et al. entitled “Method for multiclass classification in open-set scenarios and uses thereof.” address the problem of rejection of unknown inputs in multiclass classification. That patent teaches of using optimization of parameters using samples a combination of samples c′ and b′ to determine an optimal ratio threshold, and then using ratios of similarity scores between the input and two different classes. The approach presumes the feature representation, and the classifiers are both fixed. The patent teaches only how to select the threshold for the ratio of scores, which is used to classify an input as being from an unknown class.
Recent advances in classification use deep networks and machine learning to determine better features for classification, e.g., U.S. Pat. No. 9,730,643, issued Aug. 15, 2017, to Bogdan Georgescu, Yefeng Zheng, Hien Nguyen, Vivek Kumar Singh, Dorin Comaniciu, and David Liu entitled “Method and system for anatomical object detection using marginal space deep neural networks.” and U.S. Pat. No. 9,965,717, issued May 8, 2018, to Zhaowen Wang, Xianming Liu, Hailin Jin, and Chen Fang entitled “Learning image representation by distilling from multi-task networks.” Neither has an effective approach to address unknown inputs.
Training deep networks with standard loss functions produce representations that separate the known classes well. However, because they were not designed to transform unknown inputs to any particular location, they will generally transform into features that overlap with known classes, see
The #829 patent above was an example of an ad-hoc approach for addressing unknown inputs with traditional features by adding an additional background or garbage class explicitly trained on data from b′ to represent that as just another class in the system, and then to consider unknowns as anything close to the background class. Such an approach can also be used with deep networks to learn features that better separate the background class from the known classes. For example, US Patent US10289910B1 issued May 14, 2019, to Chen et al. entitled “System and method for performing real-time video object recognition utilizing convolutional neural networks” includes training a background class to improve system robustness. While an ad-hoc approach of training background class can improve robustness, there are infinitely many potential unknowns, and the background class cannot sample them well. Furthermore, when treated as a normal class, the background class can only be adjacent to a small number of the known classes. Therefore, unknown inputs that are more similar to the non-adjacent classes cannot easily map to the background class. Thus when unknown inputs are presented to the system, they will still frequently overlap with the known classes, see
Techniques have been developed that more formally address the rejection of samples x∈u, for example, see Abhijit Bendale and Terrance E. Boult: “Towards open set deep networks.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1563-1572. 2016, referred to herein as OpenMax. However, that just uses the deep features of known classes, has no representation of unknown inputs and has no way to improve the robustness to unknowns. If the deep features overlap, as in see
One of the limitations with the background class is that it requires the features of all unknown samples to be in one region of feature space, independent of the similarity of that class to the known classes. An important question not addressed in prior work is if there exists a better and simpler representation, especially one that is more effective for low false accept performance for unknown inputs.
What is needed is a multi-class recognition system that can explicitly reason about unknown inputs, and that improves its performance when given more examples of classes that are not of interest. The things that are needed will be put forth as solutions in the next section.
It is an object of this invention to develop a multi-class classification system that is robust to unknown inputs, and that can improve its performance using added examples.
Still other objects and advantages of the invention will in part be obvious and will in part be apparent from the specification and drawings.
In order to overcome the problem of unknown inputs, the invention develops a classification system that uses an explicit representation of unknown classes, which can be near all known classes in input space, and develops a method of training such classification system so as to learn feature representations that send most unknowns near the desired explicit representation while keeping known classes farther away from that representation.
The invention accordingly includes training with a mixture of known data (x∈c′) and known unknown data (x∈b′) using a robust loss function that treats known inputs and unknown inputs separately, such that reducing the robust loss will drive the system to learn to transform known inputs to representations separate from other classes and learn to transform the unknown samples to a desired representation, e.g. the origin or the average of the representation of known classes. The classification system can use the learned transformations and representations to compute the similarity to known classes and dissimilarity to the desired location for unknown classes. The robust classification system can be implemented as a set of instructions stored in a non-transitory computer storage medium and executed on one or more general purpose or specialized processors.
The apparatus embodying features of multiple embodiments, combinations of elements and arrangement of parts that are adapted to effect such steps, are exemplified in the following detailed disclosure, and the scope of the invention will be indicated in the claims. While there are many potential embodiments, we begin with a description of the preferred embodiments using deep networks which provide the current state of the art in many classification problems. While the invention is far more general, we discuss it from the deep network point of view to provide a more coherent presentation, then discuss alternative embodiments afterwards.
For a more complete understanding of the invention, reference is made to the following description and accompanying drawings, in which:
As mentioned, some of the inherent problems with using a traditional network is shown in
In
However, in one embodiment of the present invention, we intentionally train the network to respond only to known inputs in a spatial region that reaches to the origin and to transform known unknown inputs to the origin, see
While one cannot anticipate all unknown unknowns, the many embodiments of the present invention have the advantage that since unknown samples should provide little to no response, the network is being trained to only respond to known inputs, increasing the probability that when an unknown from u is encountered the system will not respond.
We now develop the underlying theory for two embodiments, where for x∈b′ we maximize the entropy of the softmax scores and reduce the deep feature magnitude (∥F(x)∥), separating them from the known samples. This allows the network to have unknowns that share features with known classes as long as the response is small and may allow the network to focus learning capacity to respond to the known classes. We do this using two embodiments of a robust loss function, which can be used separately or combined. After reviewing the mathematical derivations of the robust loss function, we return to describing systems that can use these robust loss functions to develop classifiers that are robust to unknown inputs.
First, we introduce the Entropic Open-Set Loss to make the softmax responses of unknown samples uniform. Second, we expand this loss into the Objectosphere Loss, which requires the samples of c′ to have a magnitude above a specified minimum while driving the magnitude of the features of samples from b′ to zero, providing a margin in both magnitude and entropy between known and unknown samples.
In the following, for classes c∈{1, . . . , C} let Sc(x) be the standard softmax score for class c with
where lc(x) represents the log it value for class c. Let F(x) be deep feature representation from the fully connected layer that feeds into the log its. For brevity, we do not show the dependency on input x when it is obvious.
In deep networks, the most commonly used loss function is the standard softmax loss given above. While we keep the softmax loss calculation untouched for samples of c′, we modify it for training with the samples from b′ seeking to equalize their log it values lc, which will result in equal softmax scores Sc. The intuition here is that if an input is unknown, we know nothing about what classes it relates to or what features we want it to have and, hence, we want the maximum entropy distribution of uniform probabilities over the known classes. Let Sc be the softmax score as above, our Entropic Open-Set Loss JE is defined as:
We now show that the minimum of the loss JE for sample x∈b is achieved when the softmax scores Sc(x) for all known classes are identical.
For an input x∈Db, loss JE(x) is minimized when all softmax responses Sc(x) are equal: ∀c∈{1, . . . , C}:
For x∈b′ the loss JE(x) is similar in form to entropy over the per-class softmax scores. Thus, based on Shannon's entropy theory, it should be intuitive that the term is minimized when all values are equal. JE(x) is not exactly entropy, so a formal proof is given in the supplementary material.
When the log it values are equal, the loss JE(x) is minimized. The follows since if the log its are equal, say lc=η, then each softmax has an equivalent numerator (eη) and, hence, all softmax scores are equal.
While the above analysis shows that the system minimizes the loss and maximizes entropy, this minimization is at the layer of the log its in the system. One may be interested in the behavior at deeper levels of the nets. For networks whose log it layer does not have bias terms, and for x∈b′, the loss JE(x) is minimized when the deep feature that feeds into the log its is the zero vector, at which point the softmax responses Sc(x) are equal: ∀c∈{1, . . . , C}:
and the softmax and deep feature entropy is maximized. To see this let F∈M be our deep feature vector, and Wc∈M be the weights in the layer that connects F to the log it lc. Since the network does not have bias terms, lc=Wc·F, so when F={right arrow over (0)}, then the log its are all equal to zero: ∀c: lc=0. As we saw above, we have when the log its are all equal the loss JE(x) is minimized, and softmax scores are equal and maximize entropy.
While we show that at least one minimum exists when the deep feature at the layer satisfy F={right arrow over (0)}, the analysis does not show that F={right arrow over (0)} is the only minimum because it is possible there is a subspace of the feature space that is orthogonal to all Wc. Minimizing loss JE(x) may, but does not have to, result in a small magnitude on unknown inputs.
In ?? the magnitude of the unknown samples are generally lower than the magnitudes of the known samples for a typical deep network. This shows that that deep networks trained using the above loss function actually know what they do not know. Using our novel Entropic Open-Set loss ??, we are able to decrease the magnitudes of unknown samples further. For this particular example, using the embodiment with our Objectosphere loss ?? we are able to create an even better separation between known and unknown samples.
Following the above analysis, the Entropic Open-Set loss produces a network that generally represents the unknown samples with very low magnitudes, which can be seen in ??, while also producing high softmax entropy. However, there is often some overlap between the feature magnitudes of known c and unknown samples u. This should not be surprising as nothing is forcing known samples to have a large feature magnitude or always force unknown samples to have small feature magnitude. Seeking a network with large response to known and no response to unknown inputs, we attempt to put a distance margin between them. In particular, we seek to push known samples into what we call the Objectosphere, where they have large feature magnitude and low entropy—we are training the network to have a large response to known classes. Also, we penalize ∥F(x)∥ for x∈b′, to minimize feature-length and maximize entropy, with the goal of producing a network that does not highly respond to anything other than the class samples. Targeting the deep feature layer helps ensure there are no accidental minima. To formalize this, the Objectosphere loss is calculated as:
Note this penalizes the known classes if their feature magnitude is inside the boundary of the Objectosphere, and penalizes unknown classes if their magnitude is greater than zero. We now show this has only one minimum.
For networks whose log it layer does not have bias terms, given an known unknown input x, loss JR(x) is minimized if and only if the deep feature F={right arrow over (0)} which in turn ensures the softmax responses Sc(x) are equal: ∀c∈{1, . . . , C}:
maximizing entropy. The “if” follows directly from the analysis above combined with the fact that adding 0 does not change the minimum and given F={right arrow over (0)}, the log its are zero, and the softmax scores must be equal. For the only if, observe that of all features with (Wc·F)=0, c=1 . . . C that minimize JE, the added ∥F(x)∥2 ensures that the only minimum is at F={right arrow over (0)}.
The parameter ξ sets the margin, but also implicitly increases scaling and can impact learning rate; in practice, one can determine e using cross-class validation. Note that larger ξ values will generally scale up deep features, including the unknown samples, but what matters is the overall separation. As seen in the histogram plots of ??, the Objectosphere loss provides an improved separation in feature magnitudes, as compared to the Entropic Open-Set Loss.
Finally, in yet another embodiment we can combine the magnitude with a per-class score from softmax and use the number of feature dimensions to help decide when do so. For low dimensional problems after training with the Objectosphere loss, we have already trained to send unknowns to the origin where they have nearly identical scores, and so we can report/threshold just the final softmax Sc(x). When the feature dimension is large, we use what we called Scaled-Objectosphere scoring, Sc(x)·∥F(x)∥, i.e. we can explicitly scale by the deep feature magnitude. Experimental evaluations show that Scaled-Objectosphere is about the same on small problems but better for large dimensional feature representations.
To highlight the usefulness of the present inventions, we evaluate various embodiments that are build using deep networks and compare them to standard methods. For evaluation, we split the test samples into c (samples from known classes) and a (samples from unknown classes). Let θ be a probability threshold. For samples from c, we calculate the Correct Classification Rate (CCR) as the fraction of the samples where the correct class c* has maximum probability and has a probability greater than θ. We compute the False Positive Rate (FPR) as the fraction of samples from a that are classified as any known class c=1, . . . , C with a probability greater than θ.
Finally, we plot CCR versus FPR, varying the probability threshold from θ=1 on the left side to θ=0 on the right side. For θ=0, the CCR is identical to the closed-set classification accuracy. When the classification is performed in combination with detectors that produce different numbers of background samples, the normalization of FPR with an algorithm specific a might be misleading, and it is better to use the raw number of false positives on the x-axis.
The first experimental setup, uses LeNet++ (Yandong Wen et al.: “A discriminative feature learning approach for deep face recognition.” European Conference on Computer Vision. Springer, Cham, 2016.) on the MNIST Dataset (Yann LeCun: “The MNIST database of handwritten digits.” http://yann.lecun.com/exdb/mnist (1998)) which was also used in
The new algorithms significantly outperform the recent state of the art OpenMax. In Tab. 1, we show that as designed, the new algorithms do increase entropy and decrease magnitude for unknown inputs. We also tested that same trained network with different sets of unknowns u, including letters from the Devanagari script and unrelated images from CIFAR-10. We summarize the corresponding Correct Classification Rates (CCR) at various False Positive Rates (FPR) values in Tab. 2. In each case, one of the new approaches is the best, and in the 2D examples, there is not a significant difference between the two approaches in a 2D feature space.
Our second set of experiments show that our loss is also applicable to other architectures. We created a custom protocol using the CIFAR-10 and CIFAR-100 datasets. We train a ResNet 18 architecture to classify the ten classes from CIFAR-10, i.e., CIFAR-10 are our known samples c. Our background class b′ consists of all the samples from CIFAR-100 that contain any of the vehicle classes. We use 4500 samples from remaining of the CIFAR-100 as a, i.e., the unknown samples. We also test using 26032 samples of Street View House Numbers (SVHN) (Yuval Netzer et al.: “The Street View House Numbers (SVHN) Dataset.” Accessed 2016 Oct. 1. [Online] http://ufldl.stanford.edu/housenumbers.) as a. With the 1024 feature dimension of ResNet, the scaling by feature magnitude provides a noticeable improvement. This highlights the importance of minimizing the deep feature magnitude and using the magnitude margin for separation. The results are also shown in Tab. 2, and while using background class does better than Entropic Openset and ObjectoSphere at very low FAR, the Scaled-Objectosphere is the best.
0.7350
0.9108
0.9658
0.9791
c and NIST
0.9780
0.512
0.8965
0.9563
0.973
0.9787
0.9804
0.9806
0.2547
0.3896
0.5454
0.7013
0.2584
0.4334
0.6647
The present invention can be viewed as a system for robust transformation of input data into classes, i.e., robust classification. It can also be viewed as a method of transforming a classification system so as to improve its robustness to unknown inputs. We describe embodiments of both views.
The preferred embodiment of the method of transforming a classification system so as to improve its robustness to unknown inputs is summarized in
In
The previous discussion presented embodiments with our novel robust loss functions which help to transform the networking training process to produce a network that provides a far more robust transformation of the input data into deep representations or decisions. While the previous sections were focused explaining the core novelty and reducing the embodiments to practice in isolation, there is a substantial value in combining these ideas with existing inventions to provide improved systems and methods for machine learning that is robust to the unknown items that occur in real systems. This is a space of problems for which there are many related patents and patent applications, but none that provides robustness via increasing entropy on unknown inputs or reducing the magnitude of deep features. The present invention is entirely compatible with a wide range of related inventions such as:
While we had present and evaluated a few preferred embodiments of these new inventions, there is a wide range of embodiments that capture the core concept which we briefly review. Those skilled in the art will see how many variations can be applied in keeping with the core elements of the invention: putting the unknowns near all known classes, increasing entropy for known unknowns during training and potentially limiting the magnitude of deep features.
A range of embodiments can be obtained by modifying the training for known classes of interest x∈c′. For example, one can extend Equation (1), where one replaces the softmax loss for x∈c′, (known as training samples) with any of the many known loss functions, such as L1 Loss, L2 Loss, expectation loss, log loss, hinge-loss, Tanimoto loss, center-loss, powers of loss functions (squared, cubed). Even more novel loss functions, such as those based on human-perception (U.S. Pat. No. 9,792,532 issued Oct. 17 2017 to David Cox, Walter Scheirer, Samuel Anthony, and Ken Nakayama, entitled “Systems and methods for machine learning enhanced by human measurements”) could be used. Similarly the loss function for x∈c, in Equation (2), could use any added penalty that pushes ∥F(x)∥ away from zero. Changing the loss function for the knowns does not impact the novelty and usefulness of the proposed invention in handling unknown inputs but, for some problems, might provide increased accuracy for public classes.
In the previous embodiments, the “unknown class” is structured to be at the origin and other classes are pushed away with and having the softmax loss term separating them in the other dimension. One issue to consider with a different loss is where is the desired representation for the unknown inputs. For example, when using a center-loss, it is possible that the classes will not be symmetric about the origin, so a better-desired representation for the unknown inputs would be the average of the class centers. This would ensure the unknowns were near each class. In another embodiment, rather than having known classes just be a minimum distance from the origin (i.e. the unknown class) one could have a target location for each class where each class is at least a minimum distance not only from the origin but also from the nearest other class. One embodiment could do this by extending Equation (2)o with a term that considers the magnitude ∥F(x)−F(x′)∥, where x′ is the closest point from another class or the center-point of the closest class. Those skilled in the art will see how to use the core aspects of the unknowns being near each known class and having high entropy, core aspects of the invention, as a guiding principle to select the desired representation to use when combined with any particular loss function on the knowns.
Another class of alternative embodiments can be obtained by modifying the training for known unknown classes x∈b in ways that still enforce high entropy across the unknowns. For example, one can extend Equation (1) by replacing some of the softmax values with any other function that increases entropy across the known classes, e.g., true entropy measure (Σpi log pi) or KL divergence from a uniform or a known prior distribution. Another alternative, which follows from the analysis given, would be to have a loss for x∈b, which forces small deep feature magnitude ∥F(x)∥<<1 which we have shown to induce high entropy.
While the above described the use of a single η parameter as the minimum feature magnitude for “known” classes, and a target of zero for the feature magnitude for unknown classes, all that really matters is forcing them to be separated. In some problems, it is natural to see a more general measure, not just a binary separation, e.g. a face that is very blurry or very small may be clearly a face, but the actual identity of the subject might be unknown. In an embodiment to address such a problem, the system would use multiple different parameters η1, η2, . . . ηn, as the target goals for different “confidence” levels, e.g. η1=100 for very high confident targets with high resolution, η2=50 for targets that had medium quality/resolution, η3=10 for very blurry or very noisy targets and η4=0 for unknown targets.
Another embodiment would directly use user-supplied confidence measure for each input. In such an embodiment, the final magnitude would an approximate measure for the confidence of the network prediction that the input is from a known class.
While we have described the loss functions at a general level, the important transformation of the network occurs as these type of loss functions are applied during network training. These losses induce a transformation of network weights and embodiments might be applying them to a network training from scratch, to fine-tune all weights from a previously trained network, or by adding one or more layers to an existing network and training only a subset of the weights.
While the above was describing embodiments from the view of deep learning in network architectures, the inventions' concepts can be applied to any classification system with learnable weights, e.g., it could be applied on a classic bag-of-words representation with SIFT features in images or n-grams in text, which are then combined with weighted combinations of bags. In such a system, the system might learn the appropriate weights to keep the unknowns at a location between all the known classes.
Various embodiments of systems based on the current invention are shown in
A minimal system is shown in
A more extensive and adaptive system is shown in
One embodiment for system operation during testing is shown in
In a system, these transformation methods and loss computations,
It will thus be seen that the objects set forth above, among those made apparent from the preceding description, are efficiently attained and, because certain changes may be made in carrying out the above method and in the construction(s) set forth without departing from the spirit and scope of the invention, it is intended that all matter contained in the above description and shown in the accompanying drawings shall be interpreted as illustrative and not in a limiting sense.
It is also to be understood that the following claims are intended to cover all of the generic and specific features of the invention herein described and all statements of the scope of the invention which, as a matter of language, might be said to fall between.
This application claims the benefit of U.S. Provisional Patent Application Ser. No. 62/685,963, entitled “Systems and methods for network learning robust to unknown inputs,” filed Jun. 16, 2019, the contents of which are incorporated herein by reference.