This application claims priority to Indian Patent Applications IN 202311081750, filed Dec. 1, 2023; IN 202311081749, filed Dec. 1, 2023; and, IN 202411023983, filed Mar. 26, 2024, the entire contents of each of which is hereby incorporated by reference.
The present invention relates to a method of training a machine learning model to identify image features, such as dents or other surface defects. The invention also relates to a computer system and computer software configured to train a machine learning model.
Object detection is a known method to localize objects within an image. Modern deep learning algorithms solve this problem by collecting huge amounts of images with desired objects in them. A human marks a bounding box around it. These boxes are called groundtruth boxes or annotations and these are the values that an algorithm converges to during the iterative learning process. The person who does the annotation is called an annotator.
A deep learning algorithm then tries to learn this through optimization techniques to converge to the exact values given by the annotator. The process of convergence to the annotated values happens over several iterations through a process called backpropagation and gradient descent. Gradient descent calculates the error between the model prediction and the ground truth and tries to minimize this error.
In the case of detecting a rigid object with a clear outline it may be easy for an annotator to know the boundary of an object. In the case of a feature within an object (such as a dent) this may be more difficult.
Due to the ambiguous nature of dents and the frames being fed randomly for annotation, annotators can sometimes annotate or skip the ambiguous dents depending on their individual preference. Where it is annotated, the machine learning model tries to learn it as a “yes” and where it is skipped, it tries to learn it as a “no”. Due to this, the optimizer is also left in a contradictory and inconsistent situation to sometimes learn and sometimes not learn such patterns. Sometimes the ambiguous dents are so insignificant that if the machine learning model is forced to learn them, the machine learning model might end up detecting a lot of false positives making it very sensitive.
In the usual case, there is a specific class that the algorithm uses to distinguish and learn such false detections from the actual ones, called the background class. In the ambiguous case, the background class will be at odds with the dent class and cause inconsistencies.
When frames are annotated, they may be selected randomly from a video, due to which the annotator generally does not get the temporal sequence as a cue to locate the dent and mark the right bounding box. Since this is a manual process which is already tedious, providing the annotator with the temporal sequence for reference will cause the annotator more difficulty and not really solve the problem.
The annotation process may also require that the same frame be given to multiple annotators to have better consistency in the bounding box. This method was introduced to mostly reduce the errors committed due to lethargy and not solve the issues due to the ambiguity in not being able to decide the boundary. Voting techniques like these will still leave the box undecided, and consecutive frames can still be inconsistent.
Since the optimization process is blind to the knowledge of the real world and is only looking at converging to the ground truth boxes provided by the annotator, similar dents being annotated differently leave it in a confused state because it is forced to converge to different values for similar or the same patterns, which would create contradictions.
An aspect of the invention provides a method of training a machine learning model to identify image features, the method comprising: a. providing training image data, the training image data comprising a plurality of pixels; b. assigning a groundtruth annotation to each pixel, wherein each groundtruth annotation relates to a respective one of the pixels and each groundtruth annotation indicates whether or not that the pixel corresponds with an image feature; c. providing an ignore mask comprising a set of ignore flags, wherein each ignore flag relates to a respective one of the pixels and each ignore flag provides an indication that the pixel should be ignored; d. for each pixel, receiving a prediction value from the machine learning model, wherein each prediction value provides an indication of a probability of the pixel corresponding with an image feature; e. for each pixel which has no ignore flag, determining a loss value based on the prediction value and groundtruth annotation for that pixel, and f. training the machine learning model on a basis of the loss value; and for each pixel which has an ignore flag, ignoring the prediction value for that pixel so that it is not used to train the machine learning model.
Optionally c. comprises inspecting an object and generating the ignore mask on a basis of the inspection.
Optionally c. comprises providing receiving inputs from a manual inspection of an object and generating the ignore mask on a basis of the inputs.
Optionally c. comprises inspecting an object with a sensor to generate three-dimensional inspection data and generating the ignore mask on a basis of the three-dimensional inspection data.
Optionally the training image data comprises one or more images of the object.
Optionally the method further comprises generating the training image data by imaging the object.
Optionally the object is imaged with light.
Optionally the training image data comprises a series of images of the object which each contain the same feature viewed from a different viewing angle.
Optionally the method further comprises generating the training image data by imaging the object from a series of different viewing angles.
Optionally b. comprises displaying the training image data to a human annotator; and receiving a groundtruth mask via inputs from the human annotator, the groundtruth mask providing an indication that a region of the training image data contains an image feature.
Optionally d.-f. are repeated, each repeat comprising a respective training epoch.
Optionally the image feature comprises a surface defect.
Optionally the image feature comprises a surface defect of an aircraft.
Optionally the image feature comprises a dent.
Optionally the loss value is determined by the algorithm:
−yk ln pk−(1−yk)ln(1−pk).
wherein yk is a groundtruth annotation for that pixel; pk is a prediction value for that pixel, a pixel which corresponds with an image feature has a groundtruth annotation yk of 1, and a pixel which does not correspond with an image feature has a groundtruth annotation yk of 0.
Optionally after the machine learning model has been trained, it is used to segment an image in an inference phase.
Optionally c. comprises creating the ignore mask on a basis of the groundtruth mask, wherein the ignore mask comprises a loop at a periphery of the groundtruth mask, the loop having an inner edge and an outer edge.
A further aspect of the invention provides a computer system configured to train a machine learning model by the method of the preceding aspect.
A further aspect of the invention provides computer software configured to train a machine learning model by the method of the preceding aspect.
A further aspect of the invention provides a method of training a machine learning model to identify image features, the method comprising: a. providing training image data, the training image data comprising a plurality of pixels; b. providing a groundtruth mask which provides an indication that a region of the training image data contains an image feature; c. creating one or more ignore masks on a basis of the groundtruth mask, each ignore mask comprising a loop at a periphery of the groundtruth mask, the loop having an inner edge and an outer edge; d. for each pixel, receiving a prediction value from the machine learning model, wherein each prediction value provides an indication of a probability of the pixel corresponding with an image feature; e. for each pixel which coincides with the groundtruth mask and does not coincide with an ignore mask, determining a loss value based on the prediction value for that pixel and training the machine learning model on a basis of the loss value; and f. for each pixel which lies between the inner and outer edges of an ignore mask, ignoring the prediction value for that pixel so that it is not used to train the machine learning model.
Optionally the periphery of the groundtruth mask comprises a margin area extending to an edge; and all or part of an ignore mask is inside the edge so that it overlaps with the margin area of the ground truth mask.
Optionally the periphery of the groundtruth mask comprises a margin area extending to an edge of the groundtruth mask; and all or part of an ignore mask is outside the edge of the groundtruth mask so that it does not overlap with the ground truth mask.
Optionally the periphery of the groundtruth mask comprises a margin area extending to an edge of the groundtruth mask; a first part of an ignore mask is inside the edge of the groundtruth mask so that it overlaps with the margin area; and a second part of the ignore mask is outside the edge of the groundtruth mask so that it does not overlap with the ground truth mask.
Optionally the ignore mask is created on a basis of the groundtruth mask by dilation of a line following the edge of the groundtruth mask.
Optionally each ignore mask is created on a basis of the groundtruth mask by analysing the groundtruth mask by an automated edge detection process to detect an edge of the groundtruth mask; and creating the ignore mask so that it has the same shape as the edge of the groundtruth mask.
Optionally the periphery of the groundtruth mask comprises a margin area extending to an edge of the groundtruth mask; and the inner and outer edges of the ignore mask each have the same shape as the edge of the groundtruth mask.
Optionally for each ignore mask a radial distance between the inner and outer edges of the ignore mask does not vary around the ignore mask.
Optionally the method further comprises assigning a groundtruth annotation to each pixel, wherein each groundtruth annotation relates to a respective one of the pixels and each groundtruth annotation indicates whether or not that the pixel corresponds with an image feature; and for each pixel which does not coincide with an ignore mask, a loss value is determined based on the prediction value and groundtruth annotation for that pixel.
Optionally the loss value is determined by the algorithm:
−yk ln pk−(1−yk)ln(1−pk).
wherein yk is the groundtruth annotation for that pixel; pk is the prediction value for that pixel, a pixel which corresponds with an image feature has a groundtruth annotation yk of 1, and a pixel which does not correspond with an image feature has a groundtruth annotation yk of 0.
Optionally the training image data comprises a series of images of an object which each contain the same feature viewed from a different viewing angle.
Optionally the method further comprises generating the training image data by imaging the object from a series of different viewing angles.
Optionally the object is imaged with light.
Optionally b. comprises displaying the training image data to a human annotator and receiving the groundtruth mask via inputs from the human annotator.
Optionally d.-f. are repeated, each repeat comprising a respective training epoch.
Optionally the image feature comprises a surface defect.
Optionally the image feature comprises a surface defect of an aircraft.
Optionally the image feature comprises a dent.
Optionally after the machine learning model has been trained, it is used to segment an image in an inference phase.
A further aspect of the invention provides a computer system configured to train a machine learning model by the method of the preceding aspect.
A further aspect of the invention provides computer software configured to train a machine learning model by the method of the preceding aspect.
Embodiments of the invention will now be described with reference to the accompanying drawings, in which:
The computer system comprises computer software configured to train the machine learning model 2 by the method described below.
In this example the machine learning model 2 is used to identify features within an object, for instance surface defects (such as dents or scratches) of an aircraft 10 shown in
Before being used to identify surface defects in an inference phase, the machine learning model 2 must be trained in a training phase. The training phase described below broadly involves a pre-training process of providing groundtruth masks and ignore masks; and a training process involving receiving a set of prediction values from the machine learning model 2 and training the machine learning model 2 accordingly.
In a first step a., the training image data 42 is generated by imaging the aircraft 10 from a series of different viewing angles as shown in
In the example above, the camera 40 senses visible light, but in other examples the camera 40 may be replaced by an infrared camera (with an active source of infrared light), a thermal camera, or an ultrasound camera.
In the example above, the images are acquired from a single aircraft 10. In other examples, the complete set of training image data 42 may be acquired by imaging multiple aircraft, or by imaging multiple parts.
Any visible defect will be present in more than one training image, and each training mage may contain more than one defect. Here the term “visible defect” means a defect which is visible in the image and may (or may not) be visible on the aircraft 10.
In the training images 104-112 the image of the dent is clearly visible. In the training images 101-102 and 115-120 the dent is hardly visible—for instance it may be obscured by glare, or less visible due to the viewing angle of the camera or angle of the light. In the training images 103, 113 and 114 the dent is visible but the image of the dent is ambiguous.
Note that only a single dent per frame is shown in each image 101-120, but multiple dents may be visible in each frame.
In step b. of
Each groundtruth mask may comprise a set of groundtruth annotations, one for each pixel within the region. Each groundtruth annotation relates to a respective one of the pixels and provides a binary indication or flag (value 1) that the pixel corresponds with an image feature. Alternatively each groundtruth mask may consist only of an indication of the edge(s) of a region containing an image feature, rather than a “pixel-by-pixel” set of groundtruth annotations.
As shown in
If the annotator only marks the outer edge of the image feature, then optionally the interior of the image feature is “filled-in” with groundtruth annotations so each pixel within the interior of the groundtruth mask is assigned a groundtruth annotation of 1.
Any pixels which do not fall within a groundtruth mask are assigned a groundtruth annotation of 0. Hence each pixel in each image is assigned a groundtruth annotation yk of either 0 or 1.
Note that the human annotator is instructed to only generate a groundtruth mask if the feature is clearly visible. Hence no groundtruth masks have been generated for frames 101-103 or 113-120.
There may be 1000s of images to annotate, with the frames being presented randomly to different annotators. Hence frames with ambiguous images may be marked with groundtruth masks while others (like frames 103, 113 and 114) may not.
In step d. of
Each prediction value output by the machine learning model 2 provides an indication of a probability of a pixel corresponding with an image feature (for example, a dent). Each prediction value is non-binary in the sense that it can take any decimal value pk between 0 and 1. In
If the machine learning model 2 is 100% confident that a pixel should be classified as an image feature, then the prediction value output for that pixel is 1; and if the machine learning model 2 is 100% confident that a pixel should not be classified as an image feature, then the prediction value output for that pixel is 0. In most cases the prediction values will take an intermediate value between 0 and 1. For the training images 103-114, the machine learning model 2 predicts dents as regions with high prediction values. Edges of regions with high prediction values (˜1) are indicated in
Taking a simple image segmentation case by way of example with only two classes, each pixel may be classified as either a dent or as background. If the prediction value is high (for instance greater than 50%) then the pixel is classified as a dent; and if the prediction value is low (for instance less than 50%) then it is classified as background (or “not-dent”).
The regions with high prediction values in the frames 104-112 all overlap with groundtruth masks 64: region 63 for example. The regions 62 in the frames 102, 113, 114 with high prediction values do not overlap with groundtruth masks 64. This is because the images 103, 113, 114 contain ambiguous images of a dent which are sufficiently visible to be detected by the machine learning model 2, but not sufficiently clear to be annotated with a groundtruth mask.
In step c. of
The groundtruth mask 64 shown in
In this example the groundtruth mask 64 is created by a painting tool. An automated edge detection process analyses the groundtruth mask 64 to detect the outer edge 67 and generates a line 68 of pixels shown in
In all cases the ignore mask comprises a closed loop (i.e., a closed path whose initial point coincides with its terminal point) at the outer periphery 66, 67 of the groundtruth mask 64.
In the case of
The radial distance between the inner and outer edges 72, 73 of the ignore mask 75 does not vary around the ignore mask 75, and can be chosen by design.
In an alternative embodiment, the radial distance between the inner and outer edges 72, 73 of the ignore mask 75 may vary around the ignore mask 75 is a predetermined way—for instance bigger at the top of the ignore mask than at the bottom.
The ignore mask 75 may comprise a set of ignore flags, each relating to a respective one of the pixels between the inner and outer edges 72, 73 of the ignore mask 75. Each ignore flag provides a binary indication that the pixel should be ignored. Each pixel which coincides with an ignore mask is given an ignore flag.
Alternatively the ignore mask 75 may consist only of an indication of the inner and outer edges 72, 73, rather than a “pixel-by-pixel” set of ignore flags. In this case, optionally the interior of the ignore mask 75 is “filled-in” with ignore flags so each pixel within the interior has an associated ignore flag.
Optionally a pre-processing step is performed so that for all pixels in the inner part 70 of the ignore mask 75 (where the groundtruth annotation and the ignore flag are both 1) the groundtruth annotation is set to 0.
Returning to
For each pixel which has no ignore flag (i.e. it does not coincide with an ignore mask) the training engine 3 determines a loss value based on the prediction value and groundtruth annotation for that pixel. This generates a loss value per pixel, except where the pixel has an ignore flag.
By way of example, the loss value per pixel may be determined by a logistic regression function, such as the function:
−yk ln pk−(1−yk)ln(1−pk).
where yk is a groundtruth annotation and pk is a prediction value.
A pixel which corresponds with an image feature has a groundtruth annotation yk of 1, and a pixel which does not correspond with an image feature has a groundtruth annotation yk of 0. Hence yk can be 0 or 1, and pk can be any decimal value between 0 and 1. The loss value will be huge if the predicted value does not match with the groundtruth annotation, and low otherwise.
The machine learning model 2 is then trained on a basis of the loss values per pixel. For example, the loss values per pixel may be summed to determine a mean average loss value which is used to train the machine learning model 2.
For each pixel which has an ignore flag, the training engine 3 ignores the prediction value for that pixel so that it does not contribute to the calculation of the mean average loss value and hence is not used to train the machine learning model 2.
Hence all of the pixels in the core 65 of the groundtruth mask are used to train the machine learning model 2. For each pixel which lies between the inner and outer edges 72, 73 of the ignore mask 75 (and hence coincides with the ignore mask 75) the prediction value for that pixel is ignored so that it is not used to train the machine learning model 2. Hence pixels at an outer boundary 76 of the region 63 are ignored.
The ignore mask 80 of
The ignore mask 81 of
Table 1 gives eight scenarios that we will be encountered at the pixel level, and the corresponding loss value.
For each pixel which has a groundtruth annotation of zero and no ignore flag (i.e. scenarios #1 and #2) a loss value is determined based on the prediction value and groundtruth annotation for that pixel, and the machine learning model 2 is trained on a basis of the loss value to learn this pixel as background.
For each pixel which has an ignore flag (i.e. it coincides with the ignore mask) and a groundtruth annotation of zero (i.e. scenarios #3 and #4) the training engine 3 ignores the prediction value for that pixel so that it is not used to train the machine learning model 2.
For each pixel which has a groundtruth annotation of 1 (i.e. it coincides with the groundtruth mask) and no ignore flag (i.e. scenarios #5 and #6) a loss value is determined based on the prediction value and groundtruth annotation for that pixel and the machine learning model 2 is trained on a basis of the loss value to learn the pixel as a dent.
If the pre-processing step mentioned above is not performed, then the training engine 3 handles pixels #7 and #8 in the same way as pixels #3 and #4. Hence for each pixel which has an ignore flag and a groundtruth annotation of 1 (i.e. scenarios #7 and #8) the training engine 3 ignores the prediction value for that pixel so that it is not used to train the machine learning model 2.
As a consequence, pixels 65, 66 at the same location at the edge of the dent are inside the groundtruth mask in one image 112, but outside the groundtruth mask in the adjacent image 111 of the temporal series.
This inconsistency could result in the machine learning model 2 being forced to classify the pixel 65 as a dent, and the pixel 66 as background. Such contradicting inputs can result in poor and inconsistent training. The peripheral ignore masks 75, 80, 81 described above may prevent such ambiguous pixels being used to train the machine learning model 2 and this improves the quality and consistency of training.
In the examples above, the groundtruth mask 64 of
In other examples, the groundtruth mask may contain a hole so it also has an inner periphery with an inner edge. In such cases, a second (inner) peripheral ignore mask may also be generated, such an ignore mask comprising a loop at the inner periphery of the groundtruth mask, the loop having an inner edge and an outer edge.
The second (inner) peripheral ignore mask may be generated in the same way as the first (outer) peripheral ignore mask. That is, the second (inner) peripheral ignore mask may be created on a basis of the groundtruth mask by analysing the groundtruth mask by an automated edge detection process to detect an inner edge of the groundtruth mask; and creating the ignore mask so that it has the same shape as the inner edge of the groundtruth mask.
In such a case, pixels inside the inner edge of the loop will not coincide with the groundtruth mask (because they are located in the hole in the groundtruth mask). Such a second (inner) peripheral ignore mask can then be used in an identical way to the first (outer) ignore mask. In such a case, for each pixel which coincides with the groundtruth mask and does not coincide with an ignore mask (i.e. does not coincide with either the first or second peripheral ignore mask), a loss value is determined based on the prediction value for that pixel and the machine learning model is trained on a basis of the loss value; and for each pixel which lies between the inner and outer edges of an ignore mask (i.e. it coincides with either the first or second peripheral ignore mask), the prediction value for that pixel is ignored so that it is not used to train the machine learning model.
In the temporal series of
For the training images 101-102 and 115-120 where the dent is hardly visible, it is undesirable to force the machine learning model 2 to learn the region of the dent as background just because the visibility is low, since these are regions where physically the dents exist.
The term “block” ignore masks is used to refer to an ignore mask with an outer edge but no inner edge, in contrast to the “peripheral” ignore masks described above which are loops with both inner and outer edges.
Each block ignore mask in
Alternatively each block ignore mask may consist only of an indication of the rectangular outer edge, rather than a “pixel-by-pixel” set of ignore flags. In this case, optionally the interior of the block ignore mask may be “filled-in” with ignore flags so each pixel within the interior has an associated ignore flag.
In a pre-processing stage, all block ignore masks which overlap with a groundtruth mask are removed.
The training engine follows the process of Table 1, using the block ignore masks of
For each pixel which has an ignore flag (i.e. scenarios #3 and #4 in which the pixel coincides with a block ignore mask) the training engine 3 ignores the prediction value for that pixel so that it is not used to train the machine learning model 2. All pixels within the block ignore masks in frames 101, 102 and 115-120 correspond with scenario #3 or scenario #4. Scenarios #7 and #8 will not occur due to the pre-processing stage mentioned above.
For each pixel which has no ignore flag (i.e. scenarios #1, #2, #5 and #6) a loss value is determined based on the prediction value and groundtruth annotation for that pixel and the machine learning model 2 is trained on a basis of the loss value. All pixels within the groundtruth masks of
Note that there are three regions 62 with high prediction values which do not overlap with groundtruth masks 64. In the previous method, which generates peripheral ignore masks 75, 80, 81 on the basis of the groundtruth masks 64, the pixels in the core of these regions 62 correspond with scenario #2 and are used to train the machine learning model 2 as a false positive. In the case of
During the training process illustrated in
The peripheral ignore masks 75, 80, 81 of
Optionally these two solutions may be used together in the same training process: i.e. the peripheral ignore masks 75, 80, 81 of
In the examples above, a single prediction value is determined per pixel by the machine learning model 2. This prediction value provides an indication of a probability of the pixel corresponding with an image feature, such as a dent. In other embodiments of the invention, the machine learning model 2 may output multiple prediction values per pixel, each prediction value associated with a different defect class such as a dent or a scratch. These multiple prediction values can then be used to classify each pixel as a dent and/or a scratch for example.
Where there are more than two defect classes, then the annotator may also have the ability to generate groundtruth masks associated with each defect class. Ignore masks for these multiple defect classes may also be generated and used as described above.
In other embodiments of the invention, the machine learning model 2 may output a single prediction value per pixel, and the single prediction value is used to identify the pixel as background, unclear, or a defect: for instance 0-30%=background class; 30%-70%=intermediate class; 70%-100%=dent class.
After the machine learning model 2 has been trained to identify image features by the process above, it can then be used to segment an image in an inference phase. In the inference phase, each pixel is classified on the basis of a prediction value output by the trained machine learning model 2.
Where the word ‘or’ appears this is to be construed to mean ‘and/or’ such that items referred to are not necessarily mutually exclusive and may be used in any appropriate combination.
Although the invention has been described above with reference to one or more preferred embodiments, it will be appreciated that various changes or modifications may be made without departing from the scope of the invention as defined in the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
202311081749 | Dec 2023 | IN | national |
202311081750 | Dec 2023 | IN | national |
202411023983 | Mar 2024 | IN | national |