The invention relates to a method for object segmentation in images. The invention also relates to a data processing system, a computer program product and a computer readable medium implementing the method.
In modern computer vision, image understanding is generally approached through specific tasks such as object detection and semantic or instance-level segmentation, or in other words, object segmentation. In object detection, the location of objects or object instances (i.e. a specific sample/species of an object within an object category) in the image, e.g. individual cars, pedestrians, traffic signs in case of automotive applications, are predicted as the pixel coordinates of boxes (rectangles) around that object, usually called bounding boxes. Semantic or instance segmentation tasks on the other hand aim at a dense, pixel-level labeling of the whole image, specifying the object category and/or the specific instance for every pixel. In particular, the task of instance segmentation in images is to label each pixel with an identification tag, a number or a code of the instance that the pixel belongs to. As a result, a mask is provided for each object marking those pixels in the image that are associated with the object. This type of representation gives a more precise description on the location, extent, and shape of the objects visible in the scene than the commonly used bounding box (or bounding rectangle) representation is capable of.
A pixel-level segmentation method is disclosed in U.S. Pat. No. 10,067,509 B1 for detecting occluding objects. The method performs pixel-level instance segmentation by predicting for each pixel a) semantic label of different target categories (e.g. car, pedestrian), and b) a binary label indicating whether the pixel is a contour point or not. The individual instance masks can be recovered by separating the pixels of a category with the predicted contours.
The above technical solution is extended in U.S. Pat. No. 10,311,312 B2, wherein two separate classifiers are trained for handling static and dynamic cases separately. The dynamic classifier is used if the tracking of a particular vehicle on multiple video frames is successful, otherwise the static classifier is applied on individual frames. The same pixel-level approach is used for segmentation as in the above document.
Document US 2018/0108137 A1 also discloses an instance-level semantic segmentation system, wherein a rough location of a target object in the image is determined by predicting a bounding box around each object. Then in the second step, a pixel-level instance mask is predicted using the above bounding box of each object instance.
The main disadvantage of pixel-level segmentation methods is their high computational need and the related time consumption. In certain aspects of the segmentation task, the speed of recognition is crucial, i.e. in case of self-driving cars. Methods that require too much computational power or simply too slow for real-time results are not fit for such applications.
An approach to speed up the computation lead to the following technical solutions, in which a smaller map (instance map) is created, i.e. with lower resolution, and then the map is scaled to the size of the image.
One example is a publication of K. He et al. “Mask R-CNN” (2017) disclosing a two-step approach for object instance segmentation. Firstly, an object proposal step is applied to roughly localize all the instances of a target category or categories in the image. Then, in a second step the instance segmentation problem is defined as a pixel-labeling task, where the binary pixels of the segmentation mask of an instance are directly predicted on a fixed-sized (e.g. 14×14 pixels) grid. Here, binary ones in the mask denote the pixel locations of the corresponding object. Then the predicted mask is transformed/rescaled back to the proper location and size of the object. The disadvantage of this solution is that even for such a small grid, a very complex neural network is to be used having an output dimension of at least 14×14=122. This amount of nodes and weighting factors slow down the segmentation, furthermore the generated small map has to be scaled and interpolated to the size of the full image that further deteriorates the speed and the efficiency of the method.
A similar method is disclosed in US 2009/0340462 A1, wherein a neural network is used to identify pixels of salient objects in images. First, the resolution of the image is decreased, and the neural network is applied on this reduced image to identify the pixels belonging to the main objects in the image, based on which the main objects' pixels are identified in the original, full resolution image.
The disadvantage of the above technical solutions is that a further step is required to determine the contour or the pixels of the objects in the image that requires further computational power and time.
Another approach for segmentation is to approximate the contour of an object by a polygon and, instead of the exact contour of the object, the polygon is predicted, preferably by a trained neural network. This approach significantly reduces the computational time and needs compared to the pixel-level segmentation techniques.
In a publication of L. Castrejón et al. “Annotating Object Instances with a Polygon-RNN” (The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5230-5238), the authors propose a solution that represents an instance segmentation mask by a polygon outlining the instance. The vertices of the polygon are reconstructed sequentially one-by-one with a recurrent neural network. An extension of this approach from the same research group is “Polygon-RNN++” (2018). The disadvantage of this solution is that the recurrent neural networks have a complex structure resulting in slower computations.
A further approach is introduced in a publication of N. Benbarka et al. “FourierNet: Compact mask representation for instance segmentation using differentiable shape decoders” (arXiv:2002.02709 [cs.CV], 2020). This publication discloses a single-stage segmentation method in contrast to two-stage segmentation methods. This approach represents the contour of an object by a set of points that are intersections of imaginary rays starting from near the center of mass of the contour and the contour, which is a single-component parametrization of the contour. In case more intersections exist for a single ray, then the intersection farther from the center of mass is selected. A neural network is used to predict the Fourier coefficients (Fourier descriptor) of the set of points representing the contour, from which the contour is reconstructed by inverse Fourier transform. However, the steps used in this method on the one hand limit the complexity of shapes to be modelled, and on the other hand reduce the information present in the neglected contour coordinates. The greatest disadvantage of this method is that the contours of objects having a concave shape can never be correctly predicted and reconstructed, only an envelope of the contour of the object can be approximated. In certain applications however there is a need for exact shape or contour reconstruction.
In view of the known approaches, there is a need for a method by the help of which a segmentation of objects in images can be carried out for objects having any contours, including concave shaped contours.
The primary object of the invention is to provide a method for object segmentation in an image, which is free of the disadvantages of prior art approaches to the greatest possible extent.
The object of the invention is to provide a method by the help of which objects in images can be segmented in a more efficient way than the prior art approaches in order to enable segmentation of objects having any shapes or contours.
Accordingly, the object of the invention is to provide a reliable segmentation method that is capable of reconstructing the contour of objects with any shape in images.
The further object of the invention is to provide a data processing system that comprises means for carrying out the steps of the method according to the invention.
Furthermore, the object of the invention is to provide a non-transitory computer program product for implementing the steps of the method according to the invention on one or more computers and a non-transitory computer readable medium comprising instructions for carrying out the steps of the method on one or more computers.
The objects of the invention can be achieved by the method according to claim 1. The objects of the invention can be further achieved by the data processing system according to claim 11, by the non-transitory computer program product according to claim 12, and by the non-transitory computer readable medium according to claim 13. Preferred embodiments of the invention are defined in the dependent claims.
The main advantage of the method according to the invention compared to prior art approaches comes from the fact that it can reconstruct a contour (segmentation contour) of an object having any shape, including complex shapes, even a concave shape. This way a more accurate object segmentation can be achieved than by any methods known in the prior art, as the location of the objects can be determined by higher precision.
It has been recognized, that using a two-coordinate parametrization of a contour allows for an accurate representation of any closed two-dimensional curves, i.e. complex contours of objects in images, without ambiguities. Segmentation methods are frequently used in decision making processes, e.g. in automotive applications, where the speed of the decision making can be crucial. An obvious choice to speed up the decision making process is to use predetermined, simple shapes that can be easily and quickly recognized even from a few characteristic points. Contrary to this approach, the method according to the invention is adapted to recognize arbitrary, complex shapes. It has been recognized that although the determination of arbitrary, complex shapes may increase the computational needs of the method, it also increases the precision of the decision making process based on the detected contours, which is desired in various safety-critical applications such as applications related to self-driving vehicles or medical applications. Moreover, the parameterization of the segmentation contour according to the invention provides flexibility and control to balance between the accuracy and computational efficiency of the method.
It has also been recognized, that instead of a simple two-coordinate representation of the contour a transformed (e.g. Fourier transformed) representation is to be used in order to decrease the computational needs for estimating the representation of the contour by a machine learning system implementing any known machine learning algorithm or method, e.g. comprising a neural network, e.g. a convolutional neural network (CNN), which provides an efficient estimation of the representation of the contour. By using the transformed representation having a fixed length resulting in a compact representation of the contour, the complexity of the trained machine learning system can be reduced as compared to the current technology involving pixel-level instance description, which results in a higher processing speed, and in a smaller memory footprint. It is also advantageous that the contour can be easily reconstructed from the compact representation.
Another advantage is that due to the smaller computational needs, the method according to the invention can reconstruct the contours of the objects with a higher precision compared to the prior art solutions if using the same computational power.
The method according to the invention is capable of segmenting multiple objects in the image including objects that are occluded or partially hidden. An occluded or partially hidden object is an object that is not visible in the image in total, e.g. because at least a part of it is hidden behind another object, in which case the visible parts of the objects can be segmented and depending on the specific embodiments of the method, the occluded parts of the object may be ignored or be assigned to the visible parts of the same object.
The method according to the invention is capable of reconstructing the contour of the object by estimating a typical appearance (a basic representation or a reference contour) of the shape of the object and also by estimating at least one geometric parameter of a geometric transformation such as scaling, rotation, mirroring, or translation of the object, or a combination thereof, wherein the geometric parameter or geometric parameters correspond to the size, position and orientation of the object in the image. Separating the basic shape of the object and the above-mentioned geometric transformations provides a representation of object contours that can be estimated in a more efficient manner, wherein the basic shape or reference contour is invariant to the above geometric transformations. Certain machine learning algorithms/methods, e.g. convolutional neural networks are invariant to translations, which aligns well with such a disjoint representation of the object contour. By the application of this disjoint representation, the same reference contours can be estimated for the same object located at different parts of the image, regardless of their sizes, positions and orientations. The information regarding to the exact size, position and orientation can be encoded in a few geometric parameters. Furthermore, in real applications, the geometric transformations well approximate rigid-body transformations in the 3D space, i.e. movement of an object as projected to the image. Therefore, in case of several images are processed in a sequence, e.g. images of a camera stream, wherein the consecutive images are similar to each other, the overall shape of the object in the images is almost identical, but the size, position or orientation can be slightly different. The approach of determining the shape and the corresponding geometric parameters further reduces the computational needs of the method and allows for a faster segmentation of the objects in the images. Such a representation is easier to be learned by machine learning methods, including but not limited to convolutional neural networks.
The method according to the invention therefore can be used in any vision-based scene understanding system, including medical applications (medical image processing) or improving the vision of self-driving vehicles.
Preferred embodiments of the invention are described below by way of example with reference to the following drawings, where
The invention relates to a method for segmentation of objects or object instances in images, all together called object segmentation. The object instances are preferably limited to an application-specific set of categories of interest, e.g. cars, pedestrians etc. in an automotive application or various organs in case of a medical application. Throughout the description, the word “object” can indicate different object instances from the same category, or objects from different categories. Moreover, the term “object segmentation” is used for the task of instance segmentation, i.e. to label the pixels of an image with an identification tag of the corresponding object instance the pixels belongs to. In applications where only one object is present in the image, object segmentation simplifies to semantic segmentation, i.e. labeling each pixel with its category.
In case of object segmentation, the usual task is to predict a label (an identification tag, e.g. a number, a code or a tag) for each pixel corresponding to a particular object in the image, resulting in a pixel-wise object mask. In the method according to the invention, the objects to be segmented are represented by their contour (segmentation contour) in the image, based on which a mask for the object can be created, i.e. by including the pixels within the segmentation contour with or without the segmentation contour itself.
According to the invention, instead of determining the real-space coordinates of the segmentation contour points directly, a representation, preferably a compact representation, is generated from the points of the segmentation contour. This representation of the segmentation contour (usually called a descriptor of the contour or a descriptor) can be learned by a machine learning system. The machine learning system preferably implements any known machine learning algorithm or method, e.g. the machine learning system comprises a neural network, preferably a convolutional neural network. A trained machine learning system can determine the descriptor, from which the segmentation contour can be reconstructed, preferably by an inverse transform. Embodiments of the method according to the invention shown in the figures are implemented by applying neural networks as a machine learning algorithm due to their high efficiency in segmentation tasks compared to other machine learning algorithms/methods known in the art. However, other machine learning algorithms/methods can also be used, for example methods for filtering or feature extraction (e.g. scale-invariant feature transform (SIFT), histogram of oriented gradients (HOG), Haar-filter or Gabor-filter), regression methods (e.g. single vector regression (SVR) or decision tree), ensemble methods (e.g. random forest, boosting), feature selection (e.g. minimum redundancy and maximum relevance (MRMR)), dimension reduction (e.g. principal component analysis (PCA)) or any suitable combinations thereof. The machine learning algorithm/method has to be trained to match an image and a representation (descriptor) of a contour of an object from which the segmentation contour can be reconstructed.
The method according to the invention for object segmentation in an image, comprises the steps of
According to the invention, the segmentation contour of the object is a closed two-dimensional parametric curve, points (contour points) of which is defined by two coordinate components, wherein both coordinate components are parametrized. The use of a discrete number of contour points can limit the complexity of the method and reduce the computational needs.
Preferably, the two coordinate components of the segmentation contour are independently parametrized, e.g. by a time-like parameter, preferably by a single time-like parameter. The parametrized coordinate components within the 2D plane may be expressed in any coordinate system and reference frame, using e.g. a Cartesian, a polar or a complex (or any alternative) coordinate representation. The advantage of parametrizing both coordinate components of the two-dimensional curve is that curves having any shape (including concave shapes) can be represented. In a preferred embodiment of the method according to the invention, the segmentation contour is represented by Cartesian coordinates, even more preferably the segmentation contour is represented by Cartesian coordinates parametrized with a time-like parameter t encoding the trajectory r of the curve, i.e. r(t)=(x(t), y(t)), wherein x and y are functions defining respective Cartesian coordinates of contour points of the segmentation contour. In another preferred embodiment the parametrization of the segmentation contour is encoded via its tangent vector, i.e. the velocity along the trajectory, which can be extracted as displacement vectors of the contour points. In a further preferred embodiment, the segmentation contour is parametrized as a sequence of standardized line segments linking together the points of the segmentation contour.
Instead of directly estimating the contour points of the segmentation contour, the method according to the invention estimates, by the trained machine learning system, a representation, preferably a transformed, compact representation of the contour. The accuracy of the method, i.e. the closeness of the segmentation contour to the exact contour of the object, can be controlled by the dimensions of the transformed representation, e.g. also considering the available computational resources. The transformed representation also allows for a disjoint representation of the segmentation contour comprising a generic shape of the object (e.g. a reference contour) and a geometrical transformation imposed on the shape. In a preferred embodiment of the invention, the compact representation can be generated by Fourier transform, even more preferably by discrete Fourier transform.
Accordingly, in a preferred embodiment of the invention, the sequence of the above displacement vectors is transformed from the spatial domain into the frequency domain, preferably by Fourier transform, even more preferably by discrete Fourier transform. As a result, the segmentation contour is represented by amplitudes of Fourier harmonics. This particular representation is commonly referred to as an elliptic Fourier descriptor (EFD) of a curve in the literature (F. P. Kuhl and C. R. Giardina, “Elliptic Fourier features of a closed contour”, Computer Graphics and Image Processing, 1982). The advantage of the discrete Fourier transform is that it may be performed on any two-component parametrization of the curve. In order to obtain a compact representation of the segmentation contour, the number of coefficients of the descriptor are limited to a fixed value. This value can be an input parameter for the machine learning algorithm when estimating the representation (descriptor) of the segmentation contour, and it controls the accuracy (precision) of the reconstructed segmentation contour. By representing the segmentation contour of an object by a single vector of coefficients, a compact representation of fixed length is provided. The length of this vector is proportional to the number of harmonics used, e.g. in case of Fourier transform the number of Fourier harmonics indicating the order of the transform. Hereinafter this fixed-length vector is referred to as the Fourier descriptor.
For a single frequency, two real-valued Fourier coefficients account for the amplitude and phase of the given harmonic, respectively. Altogether, four real-valued coefficients are required to represent a single frequency component of the two-component trajectory along the real-space contour in two-dimension. As a result, in case the segmentation contour was represented by an elliptic Fourier descriptor, the length of the descriptor is 4×O, where O denotes the number of harmonics (also referred to as order in the literature) of the transform. This way the method according to the invention simplifies the task of object segmentation to a regression of a fixed-length vector containing the descriptor of the segmentation contour. This task can be learned from an existing set of training data containing image and segmentation contour (or object mask) pairs, from which the above vector representation can be derived. The regression can be implemented in any form including machine learning methods/algorithms, for example by convolutional neural networks. The segmentation contour can be reconstructed from the descriptor by applying an inverse of the transform, i.e. in case of elliptic Fourier descriptors the inverse discrete Fourier transform can be used.
It is emphasized that any suitable representations of coefficients such as Cartesian coordinates, polar coordinates or complex vectors are equivalent for the proposed method.
In the embodiment illustrated in
In a further preferred embodiment of the method according to the invention (not illustrated, the reference signs refer to the ones in
An implementation of the method according to
A detailed comparison of the reconstructed segmentation contours determined by manual annotations, by the method according to
The third row of
The fourth row of
As it can be seen in
In case of an occlusion, it is preferable to denote parts of the same object with the same identification tag during segmentation. According to a preferred embodiment of the method according to the invention, an ordering parameter, representing e.g. a depth or a layer, can be determined for occluding objects. Based on the ordering parameter, e.g. having an ordering parameter with a same or a similar value, segmented contours belonging to the same occluded object can be identified and the same identification tag can be assigned to segmentation contours belonging to the same object.
In a further preferred embodiment, for handling occlusions, a visibility score value is generated by the machine learning algorithm, preferably for the estimated representation of each segmentation contour. The visibility score value preferably indicates visibility or non-visibility of each object part resulting from breaking up the object into parts by the occlusion. Based on the visibility score value, non-visible object parts can be ignored or omitted, e.g. can be excluded from a segmented image, or alternatively, the non-visible object parts can be assigned to the visible parts of the same object, i.e. by assigning the same identification tag. The same identification tags are preferably assigned based on an ordering parameter as described above.
According to the embodiment shown in
The visibility score value of visible object parts in this example is 1, however, other non-zero values can be used to indicate further parameters or features of the visible objects or object parts. In certain embodiments of the method according to the invention, the visibility score value can comprise a value of an ordering parameter, e.g. corresponding to a distance from the camera taking the image 10. Based on the visibility score value and/or the ordering parameter, a relation, preferably a spatial relation of the segmentation contours can be determined, and segmentation contours belonging to the same object can be identified.
In the example according to
The invention further relates to a data processing system comprising means for carrying out the steps of the method according to the invention. The data processing system is preferably implemented on one or more computers, and it is trained for object segmentation, e.g. for providing an estimation of a representation of a segmentation contour of an object. The input of the data processing system is an image to be segmented, the image including one or more objects or object parts. The segmentation contour of the object is represented as a closed two-dimensional parametric curve, each point of which is defined by two coordinate components, wherein both coordinate components are parametrized. Characteristic features of the representation of the segmentation contour has been discussed in more detail in connection with
Preferably, the machine learning system of the data processing system is further trained to provide an estimation of at least one parameter of a geometric transformation and/or an identification tag for each object, wherein the geometric transformation comprises scaling, translation, rotation and/or mirroring, and the identification tag is preferably a unique identifier of each object.
In a preferred embodiment, the same identification tag is assigned to parts of the same object. In a further preferred embodiment, the machine learning system of the data processing system is trained to segment multiple objects in an image, and/or objects braking up into parts due to occlusion. A preferred data processing system comprises a machine learning system that is trained to determine a visibility score value for each object or object part relating to the visibility of the respective object or object part. For handling occlusions, the visibility score value may comprise a value of an ordering parameter representing relative position of the occluding object, based on which the same identification tag can be assigned to object parts belonging to the same object.
The machine learning system of the data processing system preferably includes a neural network, more preferably a convolutional neural network, trained for object segmentation.
The invention, furthermore, relates to a computer program product comprising instructions which, when the program is executed by a computer, cause the computer to carry out an embodiment of the method according to the invention.
The computer program product may be executable by one or more computers.
The invention also relates to a computer readable medium comprising instructions which, when executed by a computer, cause the computer to carry out an embodiment of the method according to the invention.
The computer readable medium may be a single one or comprise more separate pieces.
The invention is, of course, not limited to the preferred embodiments described in detail above, but further variants, modifications and developments are possible within the scope of protection determined by the claims. Furthermore, all embodiments that can be defined by any arbitrary dependent claim combination belong to the invention.
Number | Date | Country | Kind |
---|---|---|---|
P2000238 | Jul 2020 | HU | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/HU2020/050059 | 12/16/2020 | WO |