The present application claims the benefit under 35 U.S.C. § 119 of European Patent Application No. EP 20174435.6 filed on May 13, 2020, which is expressly incorporated herein by reference in its entirety.
The present invention relates to a system and computer-implemented method for training a machine learnable model for classification of objects in spatial data, such as image data. The present invention further relates to a system and computer-implemented method for classifying objects in spatial data using a machine learned model. The present invention further relates to a computer-readable medium comprising data representing instructions for a processor system to perform either or both computer-implemented methods.
Object classification is a key-enabler in various technical application areas, such as autonomous driving in which object classification may be used to detect and recognize objects in sensor data acquired by one or more sensors integrated into a vehicle. Such classification may also be referred to ‘recognition’ since the classification involves recognizing a class of object. For example, other road users (cars, pedestrians, etc.) or traffic signs may be recognized in image data acquired by a camera sensor. In some cases, the output of the object classification may be used to control actuators, such as actuators involved in the steering and/or braking of the vehicle, e.g., to autonomously avoid collisions with other road users. There are various other uses for object classification, for example in robotics where it may enable a robot to recognize objects in its environment and act accordingly, but also for enabling character recognition, pose recognition, image search, etc.
In general, objects may be classified in image data, e.g., as acquired by a camera, but also in other types of spatial data acquired by other types of sensors. A specific example is the classification of cars and pedestrians in LiDAR data. In general, the spatial data may be two-dimensional, three-dimensional or even higher-dimensional spatial data.
Conventionally, machine learning techniques are used for object classification. This typically involves training a machine learnable model, e.g., a neural network, with a large set of training data which resembles the type of sensor data to which the machine learnable model is to be applied. So-called deep neural networks (DNNs) represent the current state of the art in image classification, see, e.g., references [1]-[4]. They learn image classification by learning to extract discriminative features of an image which are highly optimized for the classification task given the training data at hand. However, these feature extractors are known for their inability to disentangle per se independent factors which may jointly define an object class. Rather, they often take a ‘shortcut’ by focusing only on individual characteristics which do not uniquely define the object class. This may for example result in a trained DNN recognizing an object by texture but not by shape. Another example is that a trained DNN may focus only on content information, such as texture, colors, edges, while ignoring location information, such as absolute and/or relative locations of parts of the object.
This may lead to trained DNNs generating highly confident yet erroneous predictions on instances of spatial data where so-called biased factors (e.g., the aforementioned content information) are absent or in which mainly those factors are present (e.g., the aforementioned location information) which the DNN has effectively learned to ignore. This may in particular occur if a trained DNN is applied to instances of spatial data which lie outside of the distribution of the training data, e.g., represent previously unseen outliers. However, such outliers may be unavoidable in many real-world applications.
It is conventional to greatly increase the amount of training data such that more possible instances of an object in spatial data are encountered during training, thereby reducing the chance of encountering outliers after deployment of the trained DNN. However, there are limits to such increases in training data, both practical but also inherently since in many real-world applications, there will remain a possibility of encountering outliers.
It would be desirable to be able to train a machine learnable model, and to use a machine learned model, for classification of objects in spatial data, which is more robust against outliers and which may be trained without excessive amounts of training data.
In accordance with a first aspect of the present invention, a computer-implemented method and corresponding system are provided for training a machine learnable model for classification of objects in spatial data. In accordance with a further aspect of the present invention, a computer-implemented method and corresponding system are provided for using the machine learned model for classification of objects in spatial data. In accordance with a further aspect of the present invention, a computer-readable medium is provided comprising instructions for causing a processor system to perform any of the method disclosed herein.
The above measures are based on the insight that there may be two fundamental factors defining an object class in spatial data, namely the presence of parts composing the object and their spatial arrangement. The presence of parts composing the object may be understood as a ‘content’ factor and be represented in the spatial data itself by the presence of certain texture, colors, edges, etc. The spatial arrangement of the parts may be referred to as a ‘location’ factor and may be represented in the spatial data by the parts of the object having a certain relative spatial arrangement, and in some cases the parts having certain absolute spatial positions. The inventors have recognized that it may be problematic if the machine learnable model focuses only on one of the two factors, e.g., only on the content or only on the location, or if one factor greatly outweighs the other in the decision-making of the machine learned model. Namely, this may make the machine learned model susceptible to outliers in which only one of the two factors is present in the spatial data, or in which one of the two factors has a characteristic which matches an object in the training data (‘inlier’) whereas the other factor has an entirely different characteristic (other ‘inlier’ class, or ‘outlier’), as will also be illustrated elsewhere with reference to
This may result in the machine learned model providing highly confident but erroneous classifications of objects, which may be highly problematic in real-life applications. For example, an autonomous driving system may recognize a yield sign in a triangular building feature, which may cause the vehicle to inadvertently slow down or suddenly brake.
The above measures address these problems by providing two separate classification parts: one which is focused on classifying objects based on content information and one which is focused on classifying objects based on location information. For that purpose, two separate classification parts may be provided in the machine learnable model, which may then be trained specifically on content information or on location information.
More specifically, the machine learnable model may comprise a convolutional part which may generate one or more feature maps from an instance of spatial data. Such convolutional parts are conventional, and may comprise one or more convolutional layers but also other types of layers, such as pooling layers. The one or more feature maps may represent an ‘activation’ of the convolutional part, which may, like the spatial data from which they were generated, themselves contain content information and location information. For example, the feature maps may have spatial dimensions which may represent the spatial dimensions of the spatial data, while activations in a feature map at a particular location may represent an occurrence of a feature (e.g., particular content) at the particular location.
The content classification part may be trained on a version of the one or more feature maps from which location information was removed. This version may also be referred to as a content information-specific feature map and may omit, or at least greatly reduce in terms of significance, any location information. The content classification part may thus be trained to be sensitive to content information and invariant to location information, and may elsewhere also be referred to as ‘content sensitive classifier’ (CSC). Likewise, the location classification part may be trained on a version of the one or more feature maps from which content information was removed. This version may also be referred to as a location information-specific feature map and may omit, or at least greatly reduce in terms of significance, any content information. The location classification part may thus be trained to be sensitive to location information and invariant to content information, and may also be referred to as a ‘location sensitive classifier’ (LSC). Such separation of the classification based on both factors may avoid the overall classifier being inadvertently biased on only one of the two factors and providing highly confident predictions, e.g., indicating the presence of one object class, even if the other factor is missing or indicates a different object class.
Nevertheless, the separation of the classification based on both factors may not fully diminish the susceptibility of the machine learned model to outliers. Therefore, as part of the machine learnable model, an outlier detection part may be provided which may be trained during the training to detect outliers. However, to avoid having to provide explicit outliers in the training data, and thereby having to increase the size of the training data, the outlier detection part may be trained on inlier examples while the presence of actual outliers in the input data of the machine learnable model may be mimicked during training. For that purpose, one or more of the previously generated feature maps, e.g., as generated by the convolutional part or as generated for the content classification part or the location classification part, may be modified so as to mimic the presence of outliers. Thereby, it may not be necessary to modify the spatial data itself, in which it may be difficult to create realistic outliers, but rather such outliers may be generated in the feature maps during training; namely, it may be easier to generate spatial outliers in the machine learnable model itself by feature manipulation compared to modifying actual input data.
Effectively, the machine learnable model may provide a content sensitive classifier, a location sensitive classifier and at least one outlier detector. The combination of these three parts may provide a more robust classification of objects in spatial data with respect to outliers, without having to increase the size of the training data. Namely, the object classification may now be separately with respect to content and location, which may allow a more informed decision to be taken on the object classification. Namely, if the content classification part indicates the presence of an object in class A while the location classification part indicates the presence of an object in class B, it may be decided that the actual object may represent an outlier with respect to the training data and thus that overall classification may be unreliable. Previous image classification may simply indicate the presence of class A if the classifier was inadvertently biased on content information or the presence of class B if the classifier was biased on location information. The separate classification based on content and location may thereby contribute to the explainability and interpretability of the classification, which may be of increasing importance as machine learnable models grow more complex and are used in more safety-conscious applications.
More generally speaking, both classifiers may jointly distinguish between objects in spatial data being in-distribution or marginal-out-of-distribution. Namely, in the former case, the classification output of both classifiers may be in agreement, while in the latter case, the classification output of both classifiers may be in disagreement.
To be able to even better detect and distinguish outliers, e.g., to detect and distinguish more types of outliers and/or detect outliers with greater certainty, the outlier detector may explicitly indicate whether an outlier is present or not. As such, a decision on a classification in an object class may be taken only if both classification parts classify the input instance into a same class while the outlier detector indicates that the input instance does not represent an outlier. Thereby, classification of objects may be made dependent on an agreement between the individual classifiers and the at least one outlier detector. If such agreement is not reached, a potential object may not be classified, which may in some cases involve the potential object being simple ignored, or the object may be classified as an unknown object or another type of action may be taken. This may provide a machine learned model for image classification which may allow explainable and interpretable decision-making. The respective outputs of the classifiers and outlier detector(s) may together provide such explanations, as will also be elucidated with reference to various tables in the detailed description of this specification.
The above measures may thus empower a classification network to learn the feature representation to be specifically sensitive towards the content and location factors, enabling the network to detect an out-of-distribution sample and provide an explanation of its nature. Hence, outliers sharing common content and location factors with in-distribution samples can still be identified, even if the factors are blended from different classes. Crucially, the investigation network may be trained only on in-distribution samples yet provide explicit cues for interpretable decision-making on previously unseen outliers.
Optionally, the machine learnable model comprises a location-and-content outlier detection part, and the method comprises generating the pseudo outlier feature map for the location-and-content outlier detection part by modifying feature information which is contained in the one or more previously generated feature maps and which is associated with both the location information and the content information. As one of the outlier detectors, or as the only outlier detector, the machine learnable model may comprise a location-and-content outlier detection part, which may in the following also be referred to as an outlier detection for joint marginal distribution (‘ODJ’), referring to this outlier detector being trained to detect samples which are inliers in both the location marginal distribution of the training data and the content marginal distribution of the training data, but of which the association between the marginal distributions is inaccurate in that the association between spatial and content features is incorrect. In other words, this outlier detector may detect samples which are outliers with respect to the joint marginal distribution of location and content of the classes which were present in the training data. For that purpose, a pseudo outlier feature map may be generated by modifying the feature information of one or more previously generated feature maps which contain location information and content information. For example, the one or more feature maps generated by the convolutional part may be modified to mimic the presence of actual outliers in the input data of the machine learnable model. The outline detector may be trained on the thus-generated pseudo outlier feature map(s) to classify the objects represented by the generated pseudo outlier feature map(s) as outliers.
Optionally, the machine learnable model comprises a location outlier detection part, and the method comprises generating the pseudo outlier feature map for the location outlier detection part by modifying feature information which is contained in the one or more previously generated feature maps and which is associated with the location information. As one of the outlier detectors, or as the only outlier detector, the machine learnable model may comprise a location outlier detection part, which may in the following also be referred to as an outlier detection for location marginal (‘ODL’), referring to the outlier detector being trained to detect samples which are outliers in the location marginal distribution of the training data. For that purpose, a pseudo outlier feature map may be generated by modifying the feature information of one or more previously generated feature maps which contain location information. In some examples, feature map(s) may be modified which contain predominately or only location information. For example, the location information-specific feature map for the location sensitive classifier may be modified to mimic the presence of actual outliers in the input data of the machine learnable model.
Optionally, the location outlier detection part is implemented by the location classification part by providing the pseudo outlier feature map to the location classification part as part of a separate outlier object class to be learned. Instead of implementing the outline detector for location marginal as a separate detector, e.g., as a separate head of a deep neural network, this detector may also be implemented by the location sensitive classifier itself, namely by having the location sensitive classifier learn to classify outliers from the pseudo outlier feature map and to categorize such outliers as a separate class, e.g., a class N+1 for a machine learnable model configured to classify N object classes.
Optionally, the machine learnable model comprises a content outlier detection part, and the method comprises generating the pseudo outlier feature map for the content outlier detection part by modifying feature information which is contained in the one or more previously generated feature maps and which is associated with the content information. As one of the outlier detectors, or as the only outlier detector, the machine learnable model may comprise a content outlier detection part, which may in the following also be referred to as an outlier detection for content marginal (‘ODC’), referring to the outlier detector being trained to detect samples which are outliers in the content marginal distribution of the training data. For that purpose, a pseudo outlier feature map may be generated by modifying the feature information of one or more previously generated feature maps which contain content information. In some examples, one or more feature maps from the convolutional part may be modified, after which location information may be removed from the modified feature map(s), e.g., by spatial aggregation, to obtain a pseudo outlier feature map which omits location information and which contains predominately or only content information.
Optionally, each one of the one or more feature maps generated by the convolutional part each has at least two spatial dimensions associated with the location information and wherein feature values of the one or more feature maps at each spatial coordinate together form a feature vector representing content information at the respective spatial coordinate, wherein:
The above represent advantageous ways of generating the content information-specific feature map and the location information-specific feature map. In a specific example, if the convolutional part generates an output activation of size H×W×C, with H and W referring to the height and width and thereby to the spatial dimensions and C referring to a number of channels each representing a separate feature (and thereby the output activation corresponding to C feature maps of each H×W×1), the content-specific feature map may be generated by aggregation as a 1×1×C feature map, while the location-specific feature map may be generated by aggregation as a H×W×1 feature map.
Optionally, modifying the one or more previously generated feature maps comprises at least one of:
The above represent advantageous ways of generating pseudo outlier feature maps from feature maps generated for classifiers of the machine learnable model.
Optionally, the machine learnable model is a deep neural network, wherein the convolutional part is a convolutional part of the deep neural network and wherein the content classification part and the location classification part are respective classification heads of the deep neural network. In some embodiments, the respective classifiers may be implemented as different heads of a deep neural network. In some embodiments, at least some of the outlier detectors may also be implemented as respective heads of the deep neural network. In other embodiments, individual classifiers and/or outlier detectors may be implemented as separate neural networks which may be jointly trained on the training data. In this respect, it is noted that the advantageous effects of using a content-sensitive classifier, a location-sensitive classifier and at least one outlier detector may also be obtained by other machine learnable models besides deep neural networks. For example, instead of using deep neural networks for classification and/or outlier detection, the machine learnable model may be based on Bayesian networks, Hidden Markov models, etc.
Optionally, the input interface of the system configured to classify objects in spatial data is a sensor interface to a sensor, wherein the sensor is configured to acquire the spatial data. The system may thus classify objects in sensor data, e.g., in real-time or pseudo real-time. For example, the sensor data in which objects are to be classified may be obtained from an image sensor, a lidar sensor, etc., or from a combination of such sensors.
Optionally, the system is a control system configured to adjust a control parameter based on said classification of the object. For example, the control parameter may be linked to an actuator, and changing the control parameter may change the operation of the actuator. Thereby, the control system may directly or indirectly control the actuator based on the classification results. For example, in an (semi-)autonomous vehicle, the control system may control the breaking and/or steering to avoid a detected pedestrian.
It will be appreciated by those skilled in the art that two or more of the above-mentioned embodiments, implementations, and/or optional aspects of the present invention may be combined in any way deemed useful.
Modifications and variations of any system, any computer-implemented method or any computer-readable medium, which correspond to the described modifications and variations of another one of said entities, can be carried out by a person skilled in the art on the basis of the present description.
These and other aspects of the present invention will be apparent from and elucidated further with reference to the embodiments described by way of example in the following description and with reference to the figures.
It should be noted that the figures are purely diagrammatic and not drawn to scale. In the figures, elements which correspond to elements already described may have the same reference numerals.
The following list of reference numbers is provided for facilitating the interpretation of the drawings and shall not be construed as limiting the present invention.
20 sensor
22 camera
40 actuator
42 electric motor
60 physical environment
80 (semi-)autonomous vehicle
100 system for training machine learnable model
160 processor subsystem
180 data storage interface
190 data storage
192 training data
194 data representation of machine learnable model
196 data representation of machine learned model
200 method of training machine learnable model
210 accessing training data for training
220 providing machine learnable model
225 providing outlier detection part
230 training the machine learnable model
240 generating content information-specific feature map
245 training content classification part
250 generating location information-specific feature map
255 training location classification part
260 generating pseudo outlier feature map
265 training outlier detection part
212 accessing input data for inference
220 training machine learnable model
222 using machine learned model for inference
230 providing state memory
240 extracting previous internal state information
250 updating state memory with current internal state
300 object (person) classifiable by content and location information
310 violations of content and location information
320 fruits in facial arrangement of person
330 elements of face with locations permuted
340 randomly shuffled facial elements
400 location (spatial arrangement) marginal
410 content marginal
420 in-distribution sample
430 joint-out-of-distribution sample
440 marginal-out-of-distribution sample
450 full-out-of-distribution sample
500 input instance
510 location sensitive classifier
512 feature aggregation
514 H×W×1 feature map
516 N+1 class classification
520 content sensitive classifier
522 spatial aggregation
524 1×1×C feature map
526 N-class classification
528 outlier detection for content marginal distribution
530 location-and-content outlier detection part
532 flatten
534 H×W×C feature map
536 outlier detection for joint marginal distribution
540 intermediate feature maps
600 intermediate feature maps of inliers
610 class A
620 class B
650 intermediate feature maps of pseudo outliers
660 feature map generated by removal of information
670 feature map generated by random location shuffle
680 feature map generated by content-mixing between classes
690 feature map generated by location swapping
700 system for control or monitoring using machine learned model
720 sensor data interface
722 sensor data
740 actuator interface
742 control data
760 processor subsystem
780 data storage interface
790 data storage
800 method for classifying objects in spatial data
810 accessing machine learned model
820 accessing input data
830 generating feature map(s)
840 generating content information-specific feature map
850 generating location information-specific feature map
860 generating content-based object classification result
870 generating location-based object classification result
880 generating outlier detection result
890 classifying object in spatial data
900 computer-readable medium
910 non-transitory data
The following provides with reference to
As shown in
In some embodiments of the present invention, the data storage 190 may further comprise a data representation 194 of an untrained version of the machine learnable model which may be accessed by the system 100 from the data storage 190. It will be appreciated, however, that the training data 192 and the data representation 194 of the machine learnable model may also each be accessed from a different data storage, e.g., via a different subsystem of the data storage interface 180. Each subsystem may be of a type as is described above for the data storage interface 180. In other embodiments, the data representation 194 of the untrained version of the machine learnable model may be internally generated by the system 100, for example on the basis of design and/or architectural parameters for the machine learnable model, and therefore may not be explicitly stored in the data storage 190.
The system 100 may further comprise a processor subsystem 160 which may be configured to, during operation of the system 100 and as part of the training of the machine learnable model, generate a content information-specific feature map by removing location information from the one or more feature maps and training the content classification part on the content information-specific feature map; generate a location information-specific feature map by removing content information from the one or more feature maps and training the location classification part on the location information-specific feature map; provide, as part of the machine learnable model, a least one outlier detection part for detecting outliers in input data of the machine learnable model which do not fit a distribution of the training data; and as part of the training of the machine learnable model, generate a pseudo outlier feature map by modifying one or more previously generated feature maps which are generated for the instance of the spatial data, to mimic a presence of an actual outlier in the input data of the machine learnable model, and train the outlier detection part on the pseudo outlier feature map. It will be appreciated that these aspects of the operation of the system 100 will be further explained with reference to
The system 100 may further comprise an output interface for outputting a data representation 196 of the trained machine learnable model, this model also being referred to as a machine ‘learned’ model and its data also being referred to as trained model data 196. For example, as also illustrated in
The following describes the classification of objects in spatial data with the example of images. It will be appreciated, however, that the measures described below may also be applied to other types of spatial data which may not directly be considered images.
A standard classification network may be biased towards the spatial locations of elements and thus may also classify the second image 320, showing fruits placed in the facial arrangement of the first image 300, as a face. Alternatively, a standard classification network may be biased towards the elements themselves, e.g., the content, and thus classify the third image 330 with shuffled facial elements as a face. However, for a human observer, these images do not present a challenge. Namely, while humans would, to a certain degree, recognize all three images as faces, they would recognize that only the first image 300 shows a true face and the other images show only certain facial attributes.
The machine learnable/learned model as described in this specification represents a classification framework which may be empowered to mimic a human response. Namely, the classification framework may on the one hand recognize true faces and on the other hand may also detect an outlier face and provide an explanation of its nature, e.g., by containing true facial elements but in false locations. This ability to detect and interpret outliers while performing classification may be of great importance in many applications and for example may be used for active labeling to improve network generalization or for quality-control in manufacturing industries. In particular, the described classification framework may learn the feature representation to be specifically sensitive towards the content and location factors. Besides the standard classification, this may provide additional explanation about image ambiguities in terms of content and location.
In this example, the two classes (faces and fruits) require the content and the location each to fall within a select marginal distribution, with the select marginal distributions then jointly define the object class. The marginal distributions are schematically indicated at the respective axis as respective graphs, with the graph 400 defining two marginal distributions in terms of location (spatial arrangement), namely a face-like arrangement and a vertical stack arrangement, and with the graph 410 defining two marginal distributions in terms of content, namely face features (e.g., nose, ears, eyes) and fruit features (e.g., banana, apple, cherry). Throughout this specification, the samples of which both factors fall within the respective marginal distributions and which then jointly define an object class will be referred to as ‘in-distribution’ samples, see reference numeral 420. These inliers belong to the two possible object classes: faces in which the content information falls within the ‘face feature’ marginal distribution and in which the location information falls within the ‘face-like arrangement’ marginal distribution, and fruits in which the content information falls within the ‘fruit feature’ marginal distribution and in which the location information falls within the ‘vertical stack arrangement’ marginal distribution. All other samples may be referred to as ‘out-of-distribution samples’ or ‘outliers’, see reference numerals 430-450. The out-distribution samples as a group may be further segregated into three distinct parts:
1. The joint-out-of-distribution samples 430 for which both factors are in their marginal distributions but their association is wrong, e.g., a sample having the content of one class but the location of another class (regions with dashed borders in
2. The marginal-out-of-distribution samples 440 for which one factor is out of its marginal distribution and the other is in its marginal distribution (striped regions in
3. The full-out-of-distribution samples 450 for which all factors are out of their marginal distributions (dotted regions in
In general, the structure of the out-distribution space may be more complex, e.g., with more factors besides content and location, but for the sake of simplicity, the two-factor case is considered, which leads to the three types of the outliers described above.
As will also be described with reference to
However, both classifiers may not be able to distinguish the other two out-of-distribution types, namely the joint-out-of-distribution and full-out-of-distribution samples. To also distinguish the joint-out-of-distribution case, a separate outlier detector may be provided in the classification framework. To train this outlier detector, hard negative examples may be generated in the feature space by mixing information from different samples. This ‘joint’ outlier detector may now detect samples for which both content and location information is a valid factor for a class, but which are wrongly associated. For example, the face in the third image 330 in
The classifiers described above, which may also be referred to as ‘marginal-sensitive’ classifiers, may also provide high-confident classifications even in the full-out-of-distribution case. To address this, a further outlier detector may be added for each of the classifiers. To predict outliers, each outlier detector may be trained in a similar way as the respective classifier, namely by deliberately removing or masking certain information from the internal representation used as input to the respective classifier and by modifying the internal representation to mimic the presence of outliers in the input of the machine learnable model. The output of the outlier detectors may be used to distinguish in the classification outcome of the marginal-sensitive classifiers between inliers and the corresponding out-of-the-marginal outliers. Notably, the machine learnable model incorporating the above described classification framework may be trained using only in-distribution samples.
In some examples of the present invention, the machine learnable model may thus comprise at least two classifiers and three outlier detectors which together provide additional information about content and location factors of an image. More specifically, the output of each of the classifiers and the outlier detectors may be used to differentiate between different types of outliers and provide explanations for ambiguous cases. Such outputs may for example be logged or output to an operator, or may be used to take subsequent decisions. The following briefly summarizes the three different outlier detectors and their output when applied to the different types of input instances. It can be seen that each type of input instance may be uniquely identified by the combination of outputs of the respective outlier detectors:
Content-sensitive classifier (CSC, 520): This classifier's decisions may be based upon the content of object elements/parts independent of their spatial location. Let the dimension of the input feature map Finput 540 of the CSC-branch 520 be H×W×C, where H, W C are the height, width and channels of the feature map. Note that the spatial resolution H×W may capture spatial information and channels C may encode feature representations, and that such an input feature map of H×W×C may also be considered to represent a C-tuple of H×W×1 feature maps. Upon removing the spatial information from the channels, the content-sensitive classifier may be directed to respond to the features encoded in different channels irrespective of their spatial location. For that purpose, the spatial information H×W may be aggregated 522 across the channels C in order to remove the spatial information and allow the classifier to make its decisions based on the feature representation encoded in spatially aggregated channels Fc 524 with size 1×1×C, which may elsewhere also be referred to as a content information-specific feature map. Spatial aggregation of Finput at channel k may be formulated as: Fkc=Σi=1HΣj=1WFijkinput.
Outlier detection for content marginal (ODC): Given that the CSC 520 may classify an input sample into one of the object classes, it may be desirable to provide an outlier detector to identify outliers of which the content lies outside of the marginal distribution. It may be desired to detect outliers which are unseen during the training while only using the inlier samples from the training data. The following explains how potential outliers may be generated and how the outlier detector may be trained in a self-supervised manner. Namely, hard negative examples of outliers may be generated by augmenting the intermediate feature maps Finput 540 before the spatial aggregation 522. Here, the samples with the entire content present may be considered as inliers, whereas the samples with absent, incomplete and/or mismatched content may be considered as outliers. Such outliers may be generated in the feature map Finput by removing a part of information or blending the content of samples from one class with the content of samples from another class. For example, the blending may be accomplished by replacing a patch of size h×w×C from the feature map Finput (class1) with a same patch size from the feature map of a different class Finput (not class1), where h<H and w<W. Information may for example be removed by setting all values in the patch to be 0. Such self-generated outliers in the feature space may be referred to as pseudo outliers Fpseudo. These set of outliers may be generated in every training iteration and the outlier detector may be trained on both the inlier and pseudo outlier feature maps. Note that the CSC may be trained solely on the valid inlier training data, and that the pseudo outliers may be used only to train the outlier detector (ODC).
Location-sensitive classifier (LSC, 510): This classifier's decisions may be based on the spatial locations of the object's parts/elements but not their content. Consider the input feature map Finput 540 of the LSC with dimensions H×W×C. The spatial resolution H×W may contain spatial information and channels C may encode feature representations. It may be desired to capture only the spatial information and thereby possibly discard the feature representations. Similar to the spatial aggregation in the CSC branch 520, feature aggregation 512 may be applied to the intermediate feature map Finput 540 to integrate-out content information, resulting in the feature map Fl 514 with dimensions H×W×1. The feature aggregation at every location i,j in Finput may be formulated as: Fijl=Σm=1c Fijminput, for i,j ∈ H×W. This feature aggregation may weaken the content representation but does not affect the spatial information. The classifier branch with this feature aggregation as a component (branch 510 in
Outlier detection for location marginal (ODL): This outlier detector may categorize the samples with unknown or incomplete spatial locations of object elements as outliers, as they may not correspond to the known location marginal distribution. Similar to ODC, the outlier detection may be trained in a self-supervised manner. For example, pseudo outlier samples may be generated from the aggregated feature map Fl 514. As mentioned above, potential outliers are the samples with unknown or incomplete spatial arrangement. Such outlier samples may be generated from the feature map Fl either by removing a part of information or randomly shuffling locations in the feature map. It was found that the ODL may not need to be implemented as a separate outlier detector but that the outlier detection may be implemented as an additional class in the classification branch 510. It is hypothesized that learning spatial arrangement may be ‘easier’ than learning distinct feature representations, and thus that the former may be included as an additional class in the classification branch, which is shown in
Outlier detection for joint marginal distribution (ODJ): This outlier detector may treat the samples that share similar attributes to the training data as inliers and the rest as outliers. Similar to the ODC, the pseudo outliers may be generated from the intermediate feature map 540 with dimension H×W×C. In ODC, the pseudo outliers may be subject to spatial aggregation 522 and which may result in the loss of spatial information. On the other hand, ODL may simulate pseudo outliers from the aggregated feature map and may have no cues for the content information. Unlike these two cases, ODJ may be desired to be sensitive to both content and location marginals. Hence, the entire feature map 540 of dimension H×W×C may, after flattening 532, be used as input for outlier detection so that both content and spatial information persists. The augmenting strategy may be similar to the ODC, where a part of information may be removed or information blended between inter-class features. In addition to the above strategies, also any of the two locations with patch size h×w×C may be swapped within the same feature map. This additional example may allow the outlier detector to be sensitive to changes in any of the two marginals.
With respect to the training, it is noted that the inlier examples from the training data and any generated pseudo outliers may be used to train the outlier detectors. All three outlier detectors and the two classifiers may be trained independently to each other, but may be trained in a same training session, e.g., in same or separate iterations of the training session. The following describes specific examples of training parameters and architecture/network parameters, but which are merely exemplary and entirely non-limiting.
In a specific example, both CSC and ODJ may be trained for 100 epochs starting with learning rate 0.001, with the learning rate being dropped by a factor of 0.1 after 80 epochs. It was found that LSC may be trained for fewer number of epochs than the other branches. In a specific example, it was sufficient to train only for 25 epochs with learning rate 0.001. In a specific example, a batch size 128 and Adam optimizer with no weight decay may be used. In a specific example, the network weights may be initialized with Xavier initialization. In a specific example, the mean square error loss and tan h activation function may be used for outlier detection in both CSC and ODJ. Here, the labels may also be flipped with a probability of 0.1 in every batch training, flipping the label of inlier samples as outliers and self-generated outlier samples as inliers to avoid network overfitting on inlier samples.
In a specific example, for CSC, training may be started only with the classification loss until epoch 8 while later including the loss for outlier detection. Such setting may pretrain the weights and stabilize the training for outlier detection. A classifier similar to CSC may be used in the ODJ branch to pretrain the weights until epoch 8 and later use both the classification loss and also the loss for outlier detection to stabilize rest of the training. The classifier including the outlier class in LSC may start training from scratch.
In a specific example, the architecture details of LSC may be as follows: input1c-conv16c-conv32c-conv64c-FeatureAggregation1c-conv1c-fc4. The kernel size of the convolutional layers may be [5,5,3,3] respectively. Here, input1c, conv16c, fc4 refers to input with 1 channel, convolutional layer with 16 feature maps and classifier with 4 classes (3 object classes+1 outlier class) respectively. In a specific example, the base architecture details of CSC may be as follows: input1c-conv16c-conv32c-conv64c-conv16c-SpatialAggregation16c-fc512-fc128. The classification and outlier heads are follows: fc128-fc3 and fc128-fc1. Here fc3 represents classifier for 3 class objects and fc1 refers to outlier detection neuron. In a specific example, the architecture details of ODJ may be as follows: input1c-conv16c-conv32c-conv64c-conv4c-reshape(4×16×16)-fc512-fc128-fc1. The classification loss to stabilize the training of ODJ may be as follows: fc128-fc3.
In a specific example, for the generation of outlier samples in all the three outlier branches during training, a patch size of either 3×3 or 5×5 may be chosen to remove or mix information from another class or swap the locations within the feature map as discussed elsewhere in this specification. In a specific example, instead of choosing the entire patch in a random manner, the center of the patch may be selected based on the highest activation location in the feature map. The feature map of size H×W×C may be aggregated along the channels to obtain a feature map of size H×W, which may then be normalized between [0,1] to treat it as a probability map to pick a location with probability p. In CSC and ODJ, the aggregation of channels may be performed only to pick a location with high activations but not need to be treated as any input in the network, whereas the feature map may already be aggregated in LSC. During the course of training, this scheme may enable the highest activation locations to be modified that would potentially result in outliers.
Experiments show that the measures describe in this specification may allow outliers in content marginal, location marginal and joint marginal distribution to be detected with high accuracy by training the machine learnable model (substantially) only on the in-distribution samples from the training data. The framework comprising the outlier detectors along with the classifiers may provide explicit cues for interpretable decision-making on unseen outlier samples. The following table illustrates the explicit cues for interpretable decision making on a few examples. As shown in this table, it is possible to interpret the type of outlier sample from the output of outlier detectors and class decision from the classifiers. When using the machine learned model on new input data, the system and/or method may provide the final inference as output or use the final inference in its final decision making.
The system 700 may further comprise a processor subsystem 760 which may be configured to, during operation of the system 700, apply the convolutional part of the machine learned model to the input data to generate one or more feature maps, generate a content information-specific feature map by removing location information from one of the one or more feature maps, and apply the content classification part to the content information-specific feature map to obtain a content-based object classification result. The processor subsystem 760 may be further configured to generate a location information-specific feature map by removing content information from one of the one or more feature maps, and apply the location classification part to the location information-specific feature map to obtain a location-based object classification result. The processor subsystem 760 may be further configured to apply the outlier detection part to one or more previously generated feature maps which are generated for the instance of the spatial data, to obtain an outlier detection result, and to classify the object in the spatial data in accordance with the content-based object classification result, the location-based object classification result and the outlier detection result, wherein said classifying comprises classifying the input data in accordance with an object class if the content-based object classification result and the location-based object classification result both indicate the object class and if the outlier detection result does not indicate a presence of an outlier.
In general, the processor subsystem 760 may be configured to perform any of the functions as previously described with reference to
In some embodiments of the present invention, the system 700 may comprise an actuator interface 740 for providing control data 742 to an actuator 40 in the environment 60. Such control data 742 may be generated by the processor subsystem 760 to control the actuator 40 based on the classification result, as may be generated by the machine learned model when applied to the input data 722. For example, the actuator 40 may be an electric, hydraulic, pneumatic, thermal, magnetic and/or mechanical actuator. Specific yet non-limiting examples include electrical motors, electroactive polymers, hydraulic cylinders, piezoelectric actuators, pneumatic actuators, servomechanisms, solenoids, stepper motors, etc. Such type of control is described with reference to
In other embodiments of the present invention (not shown in
In general, each system described herein, including but not limited to the system 100 of
The method 800 is shown to comprise, in a step titled “ACCESSING MACHINE LEARNED MODEL”, accessing 810 a machine learned model as described elsewhere in this specification and in a step titled “ACCESSING INPUT DATA”, accessing 820 input data, the input data comprising an instance of spatial data, the instance of the spatial data comprising an object to be classified. The method 800 is further shown to comprise, in a step titled “GENERATING FEATURE MAP(S)”, applying the convolutional part of the machine learned model to the input data to generate 830 one or more feature maps, and in a step titled “GENERATING CONTENT INFORMATION-SPECIFIC FEATURE”, generating 840 a content information-specific feature map by removing location information from one of the one or more feature maps. The method 800 is further shown to comprise, in a step titled “GENERATING LOCATION INFORMATION-SPECIFIC FEATURE”, generating 850 a location information-specific feature map by removing content information from one of the one or more feature maps, and in a step titled “GENERATING CONTENT-BASED OBJECT CLASSIFICATION RESULT”, applying 860 the content classification part to the content information-specific feature map to obtain a content-based object classification result. The method 800 is further shown to comprise, in a step titled “GENERATING LOCATION-BASED OBJECT CLASSIFICATION RESULT”, applying 870 the location classification part to the location information-specific feature map to obtain a location-based object classification result, and in a step titled “GENERATING OUTLIER DETECTION RESULT”, applying 880 the outlier detection part to one or more previously generated feature maps which are generated for the instance of the spatial data, to obtain an outlier detection result. The method 800 is further shown to comprise, in a step titled “CLASSIFYING OBJECT IN SPATIAL DATA”, classifying 890 the object in the spatial data in accordance with the content-based object classification result, the location-based object classification result and the outlier detection result, wherein said classifying comprises classifying the input data in accordance with an object class if the content-based object classification result and the location-based object classification result both indicate the object class and if the outlier detection result does not indicate a presence of an outlier.
It will be appreciated that, in general, the operations or steps of the computer-implemented methods 200 and 800 of respectively
Each method, algorithm or pseudo-code described in this specification may be implemented on a computer as a computer implemented method, as dedicated hardware, or as a combination of both. As also illustrated in
Examples, embodiments or optional features, whether indicated as non-limiting or not, are not to be understood as limiting the present invention.
It is noted a system and method may be provided for classifying objects in spatial data using a machine learned model, as well as a system and method for training the machine learned model. The machine learned model may comprise a content sensitive classifier, a location sensitive classifier and at least one outlier detector. Both classifiers may jointly distinguish between objects in spatial data being in-distribution or marginal-out-of-distribution. The outlier detection part may be trained on inlier examples from the training data, while the presence of actual outliers in the input data of the machine learnable model may be mimicked in the feature space of the machine learnable model during training. The combination of these parts may provide a more robust classification of objects in spatial data with respect to outliers, without having to increase the size of the training data.
It should be noted that the above-mentioned embodiments illustrate rather than limit the present invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the present invention. Use of the verb “comprise” and its conjugations does not exclude the presence of elements or stages other than those stated. The article “a” or “an” preceding an element does not exclude the presence of a plurality of such elements. Expressions such as “at least one of” when preceding a list or group of elements represent a selection of all or of any subset of elements from the list or group. For example, the expression, “at least one of A, B, and C” should be understood as including only A, only B, only C, both A and B, both A and C, both B and C, or all of A, B, and C. The present invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a device enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are described separately does not indicate that a combination of these measures cannot be used to advantage.
Number | Date | Country | Kind |
---|---|---|---|
20174435.6 | May 2020 | EP | regional |