Having annotated data is crucial to the training of machine-learning (ML) models or artificial neural networks. Current data annotation relies heavily on manual work, and even when computer-based tools are provided, they still require a tremendous amount of human effort (e.g., mouse clicking, drag-and-drop, etc.). This strains resources and often leads to inadequate and/or inaccurate results. Accordingly, it is highly desirable to develop systems and methods to automate the data annotation process such that more data may be obtained for ML training and/or verification.
Described herein are systems, methods, and instrumentalities associated with automatic image annotation. An apparatus capable of performing the image annotation task may include one or more processors that are configured to obtain a first image of an object and a first annotation of the object, and determine, using a machine-learned (ML) model (e.g., implemented via an artificial neural network) and the first annotation, a first plurality of features (e.g., a first feature vector) from the first image. The first annotation may be generated with human intervention (e.g., at least partially) and may identify the object in the first image, for example, through an annotation mask. The one or more processors of the apparatus may be further configured to obtain a second, un-annotated image of the object and determine, using the ML model, a second plurality of features (e.g., a second feature vector) from the second image. Using the first plurality of features extracted from the first image and the second plurality of features extracted from the second image, the one or more processors of the apparatus may be configured to generate, automatically (e.g., without human intervention), a second annotation of the object that may identify the object in the second image.
In examples, the one or more processors of the apparatus described above may be further configured to provide a user interface for generating the first annotation. In examples, the one or more processors of the apparatus may be configured to determine the first plurality of features from the first image by applying respective weights to the pixels of the first image in accordance with the first annotation. The weighted imagery data thus obtained may then be processed based on the ML model to extract the first plurality of features. In examples, the one or more processors of the apparatus may be configured to determine the first plurality of features from the first image by extracting preliminary features from the first image using the ML model and then applying respective weights to the preliminary features in accordance with the first annotation to obtain the first plurality of features.
In examples, the one or more processors of the apparatus described herein may be configured to generate the second annotation by determining one or more informative features based on the first plurality of features extracted from the first image and the second plurality of features extracted from the second image, and generating the second annotation based on the one or more informative features. For instance, the one or more processors may be configured to generate the second annotation of the object by aggregating the one or more informative features (e.g., a set of features common to both the first and the second plurality of features) into a numeric value and generating the second annotation based on the numeric value. In examples, this may be accomplished by backpropagating a gradient of the numeric value through the ML model and generating the second annotation based on respective gradient values associated with one or more pixel locations of the second image.
The first and second images described herein may be obtained from various sources including, for example, from a sensor that is configured to capture the images. Such a sensor may include a red-green-blue (RGB) sensor, a depth sensor, a thermal sensor, etc. In other examples, the first and second images may be obtained using a medical imaging modality such as a computer tomography (CT) scanner, a magnetic resonance imaging (MRI) scanner, an X-ray scanner, etc. and the object of interest may be anatomical structure such as a human organ, a human tissue, a tumor, etc. While embodiments of the present disclosure may be described using medical images as examples, those skilled in the art will appreciate that the disclosed techniques may also be used to process other types of data.
A more detailed understanding of the examples disclosed herein may be had from the following description, given by way of example in conjunction with the accompanying drawing.
The present disclosure is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings.
Image 102 may be annotated for various purposes. For example, the image may be annotated such that the object of interest in the image may be delineated (e.g., labeled or marked up) from the rest of the image and used as ground truth for training a machine learning (ML) model (e.g., an artificial neural network) for image segmentation. The annotation may be performed through annotation operations 104, which may involve human effort or intervention. For instance, annotation operations 104 may be performed via a computer-generated user interface (UI), and by displaying image 102 on the UI and requiring a user to outline the object in the image using an input device such as a computer mouse, a keyboard, a stylus, a touch screen, etc. The user interface and/or input device may, for example, allow the user to create a bounding box around the object of interest in image 102 through one or more of the following actions: clicks, taps, drags-and-drops, clicks-drags-and-releases, scratches, drawing motions, etc. These annotation operations may result in a first annotation 106 of the object of interest being created (e.g., generated). The annotation may be created in various forms including, for example, an annotation mask that may include respective values (e.g., Booleans or decimals having values between 0 and 1) for the pixels of image 102 that may indicate whether (e.g., based on a likelihood or probability) each of the pixels belongs to the object of interest or an area outside of the object of interest (e.g., a background area).
The annotation (e.g., first annotation 106) created through operations 104 may be used to annotate (e.g., automatically) one or more other images of the object of interest. Image 108 of
First annotation 206 may be used to enhance the completeness and/or accuracy of the first plurality of features f1 (e.g., which may be obtained as a feature vector or feature map). For example, using a normalized version of annotation 206 (e.g., by converting probability values in the annotation mask to a value range between 0 and 1), first image 202 (e.g., pixel values of the first image 202) may be weighted (e.g., before the weighted imagery data is passed to the ML feature extraction neural network 208) such that pixels belonging to the object of interest may be given larger weights during the feature extraction process. As another example, the normalized annotation mask may be used to apply (e.g., inside the feature extraction neural network) respective weights to the features (e.g., preliminary features) extracted by the feature extraction neural network at 208 such that features associated with the object of interest may be given larger weights in the first plurality of features f1 produced by the feature extraction neural network.
Referring back to
In examples, the second plurality of features f2 extracted from second image 204 and/or the informative features f3 may be further processed at 212 to gather information (e.g., from certain dimensions of f2) that may be used to automatically annotate the object of interest in second image 204. For example, based on informative features f3, an indicator vector having the same size as feature vectors f1 and/or f2 may be derived in which elements that correspond to informative features f3 may be given a value of 1 and the remaining elements may be given a value of 0. A score may then be calculated to aggregate of the informative features f3 and/or the informative elements of feature vector f2. Such a score may be calculated, for example, by conducting an element-wise multiplication of the indicator vector and feature vector f2. Using this calculated score, annotation 214 (e.g., a second annotation) of the object of interest may be automatically generated for second image 204, for example, by backpropagating a gradient of the score through the feature extraction neural network (e.g., the network used at 210) and determining pixel locations (e.g., spatial dimensions) that may correspond to the object of interest based on respective gradient values associated with the pixel locations. For instance, pixel locations having positive gradient values during the backpropagation (e.g., these pixel locations may make positive contributions to the desired results) may be determined to be associated with the object of interest and pixel locations having negative gradient values during the backpropagation (e.g., these pixel locations may not make contributions or may make negative contributions to the desired results) may be determined to be not associated with the object of interest. Annotation 214 of the object of interest may then be generated for the second image based on these determinations, for example, as a mask determined based on a weighted linear combination of the feature maps obtained using the feature extraction network (e.g., the gradients may operate as the weights in the linear combination).
The annotation (e.g., annotation 214) generated using the techniques described herein may be presented to a user, for example, through an user interface (e.g., the UI described above) so that further adjustments may be made by the user to refine the annotation. For example, the user interface may allow the user to adjust the contour of annotation 214 by executing one or more of the following actions: clicks, taps, drags-and-drops, clicks-drags-and-releases, scratches, drawing motions, etc. Adjustable control points may be provided along the annotation contour and the user may be able to change the shape of the annotation by manipulating one or more of these control points (e.g., by dragging and dropping the control points to various new locations on the display screen).
The first and/or second annotation described herein may be refined by a user, and a user interface (e.g., a computer generated user interface) may be provided for accomplishing the refinement. In addition, it should be noted that the automatic annotation techniques disclosed herein may be based on and/or further improved by more than one previously generated annotated image (e.g., which may be manually or automatically generated). For example, when multiple annotated images are available, an automatic annotation system or apparatus as described herein may continuously update the information that may be extracted from these annotations and use the information to improve the accuracy of the automatic annotation.
At 408, the extracted features may be compared to determine a loss, e.g., using one or more suitable loss functions (e.g., mean squared errors, L1/L2 losses, adversarial losses, etc.). The determined loss may be evaluated at 410 to determine whether one or more training termination criteria have been satisfied. For instance, a training termination criterion may be deemed satisfied if the loss(es) described above is below (or above) a predetermined thresholds, if a change in the loss(es) between two training iterations (e.g., between consecutive training iterations) falls below a predetermined threshold, etc. If the determination at 410 is that the training termination criterion has been satisfied, the training may end. Otherwise, the loss may be backpropagated (e.g., based on a gradient descent associated with the loss) through the neural network at 412 before the training returns to 406.
The pair of training images provided to the neural network may belong to the same category (e.g., both images may be brain MRI images containing a tumor) or the pair of images may belong to different categories (e.g., one image may be a normal MRI brain image and the other image may be an MRI brain image containing a tumor). As such, the loss function used to train the neural network may be selected such that feature differences between a pair of images belonging to the same category may be minimized and feature differences between a pair of images belonging to different categories may be maximized.
For simplicity of explanation, the training steps are depicted and described herein with a specific order. It should be appreciated, however, that the training operations may occur in various orders, concurrently, and/or with other operations not presented or described herein. Furthermore, it should be noted that not all operations that may be included in the training process are depicted and described herein, and not all illustrated operations are required to be performed.
The systems, methods, and/or instrumentalities described herein may be implemented using one or more processors, one or more storage devices, and/or other suitable accessory devices such as display devices, communication devices, input/output devices, etc.
Communication circuit 504 may be configured to transmit and receive information utilizing one or more communication protocols (e.g., TCP/IP) and one or more communication networks including a local area network (LAN), a wide area network (WAN), the Internet, a wireless data network (e.g., a Wi-Fi, 3G, 4G/LTE, or 5G network). Memory 506 may include a storage medium (e.g., a non-transitory storage medium) configured to store machine-readable instructions that, when executed, cause processor 502 to perform one or more of the functions described herein. Examples of the machine-readable medium may include volatile or non-volatile memory including but not limited to semiconductor memory (e.g., electrically programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM)), flash memory, and/or the like. Mass storage device 508 may include one or more magnetic disks such as one or more internal hard disks, one or more removable disks, one or more magneto-optical disks, one or more CD-ROM or DVD-ROM disks, etc., on which instructions and/or data may be stored to facilitate the operation of processor 502. Input device 510 may include a keyboard, a mouse, a voice-controlled input device, a touch sensitive input device (e.g., a touch screen), and/or the like for receiving user inputs to apparatus 500.
It should be noted that apparatus 500 may operate as a standalone device or may be connected (e.g., networked, or clustered) with other computation devices to perform the functions described herein. And even though only one instance of each component is shown in
While this disclosure has been described in terms of certain embodiments and generally associated methods, alterations and permutations of the embodiments and methods will be apparent to those skilled in the art. Accordingly, the above description of example embodiments does not constrain this disclosure. Other changes, substitutions, and alterations are also possible without departing from the spirit and scope of this disclosure. In addition, unless specifically stated otherwise, discussions utilizing terms such as “analyzing,” “determining,” “enabling,” “identifying,” “modifying” or the like, refer to the actions and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (e.g., electronic) quantities within the computer system's registers and memories into other data represented as physical quantities within the computer system memories or other such information storage, transmission or display devices.
It is to be understood that the above description is intended to be illustrative, and not restrictive. Many other implementations will be apparent to those of skill in the art upon reading and understanding the above description. The scope of the disclosure should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.