The present invention relates to labeled and unlabeled images. More specifically, the present invention relates to systems and methods for associating unlabeled images with other, labeled images of the same locations.
The field of machine learning is a burgeoning one. Daily, more and more uses for machine learning are being discovered. Unfortunately, to properly use machine learning, data sets suitable for training are required to ensure that systems accurately and properly accomplish their tasks. As an example, for systems that recognize cars within images, training data sets of labeled images containing cars are needed. Similarly, to train systems that, for example, track the number of trucks crossing a border, data sets of labeled images containing trucks are required.
As is known in the field, these labeled images are used so that, by exposing systems to multiple images of the same item in varying contexts, the systems can learn how to recognize that item. However, as is also known in the field, obtaining labeled images which can be used for training machine learning systems is not only difficult, it can also be quite expensive. In many instances, such labeled images are manually labeled, i.e., labels are assigned to each image by a person. Since data sets can sometimes include thousands of images, manually labeling these data sets can be a very time-consuming task.
It should be clear that labeling video frames also runs into the same issues. As an example, a 15 minute video running at 24 frames per second will have 21,600 frames. If each frame is to be labeled so that the video can be used as a training data set, manually labeling the 21,600 frames will take hours if not days.
Moreover, manually labeling those video frames will likely introduce substantial error. Selecting, for instance, ‘the red car’ in 21,600 frames is a tedious task in addition to a time-consuming one. The person doing that labeling is likely to lose focus from time to time, and their labels may not always be accurate.
In addition, much of the labeling process is redundant. Multiple images and/or video sequences may show the same locations. As an example, multiple videos may show the same stretch of road or railroad. Manually labeling features of each individual frame in each sequence would then be an extremely repetitive task.
Many machine vision techniques have been developed to address these issues.
However, these techniques can be easily misled by small changes in an image. For instance, an image of a certain location that was captured in late spring may appear very different, on a detailed level, from an image of the same location that was taken from the same vantage point in the middle of winter. Although high-level structural features of the image (roads, trees, geographic formations, buildings, etc.) may be the same in both images, the granular details of the image may confuse typical techniques.
Additionally, many common techniques perform image similarity detection based on histograms or other pixel-based visual features. Thus, these techniques generally do not provide enough resolution for precise image alignment, as they are sensitive to variations in pixel colour (e.g., sunny vs. cloudy; grass vs. snow covering grass; tree with and without leaves), instead of using higher level abstractions (e.g., there is a large tree near a house that has stone-like walls).
From the above, there is therefore a need for systems and methods that overcome the problems of both manual and typical machine vision techniques. Preferably, such systems and methods would work to ensure that labels on labeled images are accurately placed on unlabeled images of the same locations.
The present invention provides a system and methods for associating unlabeled images with other, labeled images of the same locations. A high-level signature of each image is generated, representing high-level structural features of each image. A signature of an unlabeled image is then compared to a signature of a labeled image. If those signatures match within a margin of tolerance, the images are interpreted as representing the same location. One or more labels from the labeled image can then be automatically applied to the unlabeled image. In one embodiment, the images are frames from separate video sequences. In this embodiment, entire unlabeled video sequences can be labeled based on a labeled video sequence covering the same geographic area. In some implementations, the high-level signatures are generated by rule-based signature-generation modules. In other implementations, the signature-generation module can be a neural network, such as a convolutional neural network.
In a first aspect, the present invention provides a method for associating an unlabeled image with a labeled image, the method comprising:
In a second aspect, the present invention provides a method for associating an unlabeled frame with a labeled frame, the method comprising:
In a third aspect, the present invention provides a system for associating an unlabeled image with a labeled image, the system comprising:
In a fourth aspect, the present invention provides non-transitory computer-readable media having encoded thereon computer-readable and computer-executable instructions, which, when executed, implement a method for associating an unlabeled image with a labeled image, the method comprising:
In a fifth aspect, the present invention provides non-transitory computer-readable media having encoded thereon computer-readable and computer-executable instructions, which, when executed, implement a method for associating an unlabeled frame with a labeled frame, the method comprising:
The present invention will now be described by reference to the following figures, in which identical reference numerals refer to identical elements and in which:
The present invention provides systems and methods that allow for automatically associating unlabeled images with other, labeled images of the same location.
The labels/tags on the labeled image 20 are pre-existing labels/tags (i.e., the labels/tags are applied before the labeled image 20 is received by the system 10). Labels/tags on the labeled image 20 can include bounding boxes that each define a region of the labeled frame, wherein a specific item is contained within each bounding box. (It should be evident, of course, that a ‘bounding box’ does not have to be rectangular. The term ‘bounding box’ as used herein (and, further, the terms ‘label’ and ‘tag’) indicates an element of any shape, size, or form that delineates, highlights, or separates a feature from the rest of the frame.) In other implementations, the label/tag can include a metadata tag applied to the labeled frame as a whole, indicating the presence or absence of a specific feature within the frame as a whole. Additionally, the label/tag can indicate the location of that specific feature within the frame. In some implementations, labels/tags might function both as binary present/absent indicators and as locators. Additionally, the at least one label/tag associated with the labeled image 20 is available to the rest of the system. Conversely, of course, before being received by the system, the ‘unlabeled’ image 30 does not have a label/tag.
In some implementations, all labels from the labeled image are applied to the unlabeled image. In certain implementations, labels may be applied to transitory features of the labeled image. (Transitory features are features that may not remain in the same location over time (e.g., cars, animals, etc.). Thus, transitory features in the labeled image might not appear in the unlabeled image.) Such implementations may be configured so that only a portion of the labels from the labeled image (those corresponding to structural or non-transitory features) are applied to the unlabeled image. That is, in such cases, as the transitory features will not be found in the unlabeled image, their corresponding labels will not be applied. In other implementations, the labeled image received by the invention is labeled so that the label(s)/tag(s) correspond only to structural features of the image.
It should also be clear that the “location” represented by the images 20 and 30 is not required to be a real-world location. That is, the present invention may be configured to work with images of simulated locations, provided that one image has at least one label. Simulated locations may appear, for instance, in video games or in autonomous vehicle simulators, or in artificial or virtual reality scenarios. Although images of these simulated locations may carry location information in their metadata, that location information may not be fully accessible to third parties.
The signature generated by the signature-generation module 40 may be generated according to various well-known image similarity techniques or such techniques in combination. Image similarity detection may include image segmentation techniques, which are well-known in the art (see for instance, Zaitoun & Aqel, “Survey on Image Segmentation Techniques”, International Conference of Communication, Management and Information Technology, 2015, the entirety of which is hereby incorporated by reference). Many possible signature-generation mechanisms exist, including those based on “edge/boundary detection”, “region comparisons”, and “shape-based clustering”, among others. The signature-generation module 40 of the present invention may thus be configured according to many different techniques.
The signature-generation module 40 can thus be a rule-based module for image segmentation. However, the signature-generation module 40 preferably includes a neural network. Neural networks typically comprise many layers. Each layer performs certain operations on the data that each layer receives. A neural network can be configured so that its output is a simplified “embedding” (or “representation”) of the original input data. The degree of simplification depends on the number and type of layers and the operations they perform.
It is possible, however, to only use a portion of the available layers in a neural network. For instance, an image can be passed to a neural network having 20 internal layers, but only processed by the 18th or 19th layer, or those layers in combination. An embedding retrieved after only a few high-level layers have processed the video frame would therefore contain fairly high-level information. Such an embedding may be thought of as a “signature” of the initial image. This signature is detailed enough to capture high-level structural features of the image (again, these “structural features” may include such large-scale features as roads, trees, road signals, geographic formations, buildings, and so on). The signatures resulting from this signature-generation module 40 are, however, sufficiently high-level as to not include information on detailed or granular features of the image (such as details on the road condition or sky colour, etc.).
In implementations using a neural network, the neural network is preferably a pre-trained convolutional neural network (CNN). CNNs are known to have advantages over other neural networks when used for image processing.
The signatures produced by the signature-generation module 40 are typically numerical tensors. Non-numerical signatures are possible, however, depending on the configuration of the signature-generation module 40.
The signature-generation module 40 in
The execution module 50 compares the second signature to the first signature. The execution module 50 may use such well-known operations as cosine distance, covariance, and correlation as the basis for rule-based comparisons. However, in some implementations, the execution module 50 is another neural network. This other neural network is trained to compare signatures and determine points of importance within them. Data on those points of importance can then inform later comparisons.
It should be noted that in many implementations of the present invention, an exact match between image signatures is unlikely. That is, depending on the exact vantage points at which two images of the same location are captured, the precise locations of structural features within an image are likely to shift slightly from image to image. Exact signature matches may thus not be possible. To account for this, the execution module 50 is preferably configured to determine matches within a margin of tolerance. That margin of tolerance should be set to a level that various slight differences between images, while still ensuring that the overall structural features of the images match. Thus, the margin of tolerance used may vary depending on the images in question.
Additionally, the margin of tolerance should be set to a level that accounts for transitory features, as described above. For instance, images of a busy intersection can be assumed to be “noisy”; that is, to contain many transitory features that will not be present across time. Thus, a margin of tolerance associated with those images should be high enough to account for that noise level. Conversely, images in a rural location may be assumed to contain a lower level of noise.
Because each signature is distinct and detailed, when a match is found between a second signature and a labeled-image signature, it can be concluded that the unlabeled image and the labeled image from which those signatures were generated are images of the same location. Thus, any structural features in the labeled image would also appear in the unlabeled image. The execution module 50 can then apply any labels from the labeled image to the matching unlabeled image.
In some implementations, when the execution module 50 determines that the first signature and the second signature do not match within the margin of tolerance described above, the unlabeled image 30 can then be sent to a human for review. After review, the unlabeled image 30 may be fed back to the system 10 to be used in a training set to further train the system 10.
As should also be evident, in some embodiments of the invention, the execution module 50 can be a single module. In other embodiments, however, the execution module 50 can comprise separate modules for each task (i.e., a comparison module for comparing signatures, and a labeling module for applying labels to an unlabeled image when a matching labeled image is found).
Referring now to
As can be imagined,
For the purposes of this example, one can assume that the image in
In another embodiment of the invention, the labeled image 20 and the unlabeled image 30 can be video frames from separate video sequences. That is, the labeled image 20 can be a labeled frame from a sequence of labeled frames, and the unlabeled image 30 can be an unlabeled frame from a sequence of unlabeled frames.
Note that the labeled frames 300A-300D are coloured light grey in
For ease of comparison, it is preferable that the two sequences 300 and 310 have the same frame rate and represent the same geographic distance. However, the two sequences may or may not be the same duration. That is, as may be understood, it is not necessary for the labeled and unlabeled frames to have a one-to-one relationship. In some cases, multiple frames from the delayed sequence might all correspond to the same frame in the non-delayed sequence. The geographic location represented by the multiple frames would thus be the location at which the train stopped to wait.
The generated signatures are passed to the execution module 50. The execution module 50 then compares the signature of the first unlabeled frame to a specific signature of a specific labeled frame, using a margin of tolerance as described above. Depending on the implementation, it may be preferable for that specific labeled frame to be the first labeled frame in the labeled sequence 300 (i.e., to labeled frame 300A).
If the signature of the first unlabeled frame matches the signature of the specific labeled frame, the execution module 50 applies one or more labels from the specific labeled frame to the first unlabeled frame. On the other hand, if the signature of the first unlabeled frame does not match the signature of the specific labeled frame, the execution module 50 selects a signature of another labeled frame. The execution module 50 compares the signature of the first unlabeled frame to that newly selected signature. Again, if these signatures match, one or more labels from the labeled frame corresponding to that newly selected signature may be applied to the first unlabeled frame. If these signatures do not match, a signature of another new labeled frame may be selected for comparison. For convenience, the video sequences begin at approximately the same geographic location. In this case, a matching signature should be found for the unlabeled frame after only a few comparisons. It should be clear to a person skilled in the art that the number of possible comparisons for a signature of any individual unlabeled frame should be limited. If the number is not limited, there is a possibility of an infinite loop occurring. If that predetermined number of possible comparisons is reached for a signature of any given unlabeled frame, it can be concluded that the unlabeled frame does not match any labeled frame in the labeled sequence. In some implementations, the unlabeled frame can be sent to a human for review and potentially fed back to the system 10 to be used in a training set to further train the system 10.
Once a signature for an unlabeled frame has been found to match a signature for a labeled frame, one or more labels from the labeled frame have been applied to the matching unlabeled frame. The signature-generation module 40 can generate a new signature based on a new unlabeled frame chosen from the unlabeled sequence 310. For efficiency, it is generally preferable that this new unlabeled frame be adjacent to the first unlabeled frame from the sequence, meaning that there are no other frames between the first unlabeled frame and this new unlabeled frame in the unlabeled sequence 310. As an example, if the first unlabeled frame was 310A, it is generally preferable that the new unlabeled frame be 310B. However, in other implementations, the new unlabeled frame may not be adjacent to the first unlabeled frame. Such other implementations may be preferable depending on the user's needs.
The signature of the new unlabeled frame is then passed to the execution module 50 and compared to signatures of the labeled frames, as described above for the signature of the first unlabeled frame. Once a labeled frame is found that matches that new unlabeled frame, one or more labels may be applied to the unlabeled frame. A signature is generated for a third unlabeled frame from the unlabeled sequence 310 (e.g., for unlabeled frame 310C), and the comparison/labeling process repeated once again. This overall signature-generation/comparison/labeling process repeats until all unlabeled frames in the unlabeled sequence 310 have been processed.
Once both a signature of the labeled image and a signature of the unlabeled image have been generated at steps 510A and 510B, respectively, the signatures are compared at step 520. If the signatures match (within the margin of tolerance as described above), one or more labels from the labeled image are applied to the unlabeled image (step 530). If the signatures do not match, however, the unlabeled image can be sent to a human for review at step 540.
As discussed above, signatures of labeled frames and unlabeled frames can be generated individually on an as-needed-for-comparison basis. Alternatively, the signatures of labeled frames and the signatures of unlabeled frames can be generated in separate batches and stored for later comparison. As a further alternative, signatures of labeled frames can be generated in a batch while signatures of unlabeled frames are generated on an as-needed basis, as shown in
Again,
Meanwhile, at step 630, an unlabeled frame is selected from the unlabeled sequence. Again, it is generally preferable for that unlabeled frame is the first frame from the unlabeled sequence. A signature is then generated based on that selected unlabeled frame, at step 640.
At step 650, the signature for the selected unlabeled frame from step 640 is compared to the selected signature for a labeled frame from step 620. If the signatures match (within a predetermined margin of tolerance, as described above), one or more labels from the labeled frame are applied to the unlabeled frame (step 660). Then, at step 680, the original unlabeled sequence is examined. If any frames from the unlabeled sequence have not yet been processed (that is, a corresponding signature has not yet been generated and compared to signatures of labeled frames), the method returns to step 630 and a new unlabeled frame from the unlabeled sequence is selected. (Clearly, this new unlabeled frame is one that has not yet been processed. As mentioned above, for efficiency, the new unlabeled frame is preferably adjacent to the original unlabeled frame in the unlabeled sequence.)
Returning to step 650, the signature of the unlabeled frame might not match the selected labeled-frame signature within the predetermined margin of tolerance. In this case, step 670 determines whether a predetermined comparison limit (i.e., a predetermined maximum number of possible comparisons) has been reached. If this comparison limit has not been reached, the method returns to step 620 and a new labeled-frame signature is selected for comparison to the signature of the unlabeled frame.
If the comparison limit has been reached, however, it can be concluded that the unlabeled frame does not have a matching frame in the labeled sequence. At this point, the unlabeled frame can be sent to a human for review, and potentially fed back to the system. The method then returns to step 680. At step 680, then, the unlabeled sequence would be examined as described above.
When the examination at step 680 determines that all unlabeled frames from the original unlabeled sequence have been processed, the entire unlabeled sequence is labeled (or alternatively, flagged for review) and the method is complete.
It should be clear that the various aspects of the present invention may be implemented as software modules in an overall software system. As such, the present invention may thus take the form of computer executable instructions that, when executed, implements various software modules with predefined functions.
The embodiments of the invention may be executed by a computer processor or similar device programmed in the manner of method steps, or may be executed by an electronic system which is provided with means for executing these steps. Similarly, an electronic memory means such as computer diskettes, CD-ROMs, Random Access Memory (RAM), Read Only Memory (ROM) or similar computer software storage media labeled in the art, may be programmed to execute such method steps. As well, electronic signals representing these method steps may also be transmitted via a communication network.
Embodiments of the invention may be implemented in any conventional computer programming language. For example, preferred embodiments may be implemented in a procedural programming language (e.g., “C” or “Go”) or an object-oriented language (e.g., “C++”, “java”, “PHP”, “PYTHON” or “C#”). Alternative embodiments of the invention may be implemented as pre-programmed hardware elements, other related components, or as a combination of hardware and software components.
Embodiments can be implemented as a computer program product for use with a computer system. Such implementations may include a series of computer instructions fixed either on a tangible medium, such as a computer readable medium (e.g., a diskette, CD-ROM, ROM, or fixed disk) or transmittable to a computer system, via a modem or other interface device, such as a communications adapter connected to a network over a medium. The medium may be either a tangible medium (e.g., optical or electrical communications lines) or a medium implemented with wireless techniques (e.g., microwave, infrared or other transmission techniques). The series of computer instructions embodies all or part of the functionality previously described herein. Those skilled in the art should appreciate that such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Furthermore, such instructions may be stored in any memory device, such as semiconductor, magnetic, optical or other memory devices, and may be transmitted using any communications technology, such as optical, infrared, microwave, or other transmission technologies. It is expected that such a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation (e.g., shrink-wrapped software), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server over a network (e.g., the Internet or World Wide Web). Of course, some embodiments of the invention may be implemented as a combination of both software (e.g., a computer program product) and hardware. Still other embodiments of the invention may be implemented as entirely hardware, or entirely software (e.g., a computer program product).
A person understanding this invention may now conceive of alternative structures and embodiments or variations of the above all of which are intended to fall within the scope of the invention as defined in the claims that follow.
This application is a non-provisional patent application which claims the benefit of U.S. Provisional Application No. 62/695,897 filed on Jul. 10, 2018.
Number | Date | Country | |
---|---|---|---|
62695897 | Jul 2018 | US |