This patent document describes a system and method for finding the best object match in an object catalog given a set of one or more input images of an unknown object. The system and method are particularly useful in transportation maintenance and related infrastructure control operations, in which operators must maintain and understand the condition of a wide variety of road signs that are dispersed around an area. The system and method also may have other applications as will be described below.
The creation and maintenance of accurate datasets is required for the operation of many modern systems. Autonomous vehicles and navigation systems rely on an accurate dataset of map data that includes not only locations of streets and roads, but also locations and content of traffic control signs such as street signs, speed limit signs and directional signs. In addition, state and municipal road maintenance agencies, and contractors who work for them, rely on such datasets to track and maintain roads, signs, traffic signals and other components of transportation infrastructure in order to enable safer, more efficient driving conditions in the geographic areas for which they are responsible. Further, automated manufacturing and warehouse systems require on accurate identifications of objects in inventory, as well as signage that identifies parts stored in a particular bin location.
Systems that use computer vision to identify and classify objects are commonly used to create datasets such as those described above. However, current systems still have their limitations. For example, some computer vision systems can classify objects but lack the ability to provide other information that is critical to modern systems, such as information about the condition of the object. In addition, many computer vision systems have difficulty distinguishing similar objects from each other (such as street signs that have similar shapes, but different content printed on them, utility poles from various utilities, or different species of trees). Others may categorize dissimilar-looking objects as different objects, even though they fall into the same class (such as an older style of street sign vs. a newer style of street sign).
This document describes methods and systems that address the issues described above, and/or other issues.
This patent document presents a novel system and method for matching an object (such as a street sign) observed in one or more images to a specific object in a fixed catalog of known objects (such as a database of street signs and their locations in a geographic area). The system leverages relative scoring of catalog objects and a weighted majority algorithm to combine multiple observations to generate a final match hypothesis and associated confidence metric. The confidence metric is then used to either accept the hypothesis, or refine it via a human-in-the-loop interface. Human refinements may then be used as feedback to expand the catalog, increasing system performance accordingly.
In various system, method and computer program embodiments described in more detail below, a system includes or has access to a data store comprising a catalog of images of known objects such as road signs and other transportation infrastructure objects. In the catalog, the images include a plurality of views for at least some of the known objects. The catalog also includes, for each known object, a feature vector that represents features from one or more views of the known object. When the system receives input images that include various views of an unknown object (such as a transportation infrastructure object), the system will process the input images to generate a feature vector for the unknown object. The system will compare the feature vector for the unknown object to the stored feature vectors in the catalog to generate one or more candidate labels for the unknown object. The system will cause a user interface to output an image of the unknown object and at least one of the candidate labels. The system will then select, from the candidate labels, a final label for the unknown object. Optionally, the system may add the input images and the final label to the catalog as a new known object.
In some embodiments, before comparing the feature vector for the unknown object to the stored feature vectors in the catalog to generate one or more candidate labels for the unknown object, the system will analyze the input images to identify a primary color of the unknown object. The system will then filter the catalog to yield a known object catalog subset that excludes images of known objects that do not correspond to the primary color of the unknown object. Then, when comparing the feature vector for the unknown object to the stored feature vectors in the catalog, the system will use the known object catalog subset rather than all known objects in the larger catalog.
In some embodiments, to select the final label for the unknown object the system may accept a candidate object label that a user has accepted via a user interface. Alternatively, if the user selected a different object label via the user interface, the system may select that different object label as the final object label.
In some embodiments, when comparing the feature vector for the unknown object to the stored feature vectors in the catalog, the system may generate a confidence metric for each of the candidate object labels. If so, then when selecting the final label for the unknown object the system may select a known object that is associated with a stored feature vector for which the confidence metric exceeds a threshold.
In some embodiments, when processing the input images to generate the feature vector for the unknown object, the system may generate a matrix of feature values of each of the views of the unknown object. The matrix may have a size of M×F or F×M, in which M is a total number of the views of the input object, and F is a total number of the feature values. Each row or column of the matrix may be a vector of all feature values for one of the images in the catalog of the labeled object, and the matrix may have a number of rows or column equal to the total number of images of the known object.
In some embodiments, when comparing the feature vector for the unknown object to the stored feature vectors in the catalog to generate one or more candidate object labels, the system may generate a distance matrix comprising distances between values in the matrix of feature values and values in the feature vectors for all known objects in the catalog.
In some embodiments, when comparing the feature vector for the unknown object to the stored feature vectors in the catalog to generate one or more candidate object labels, the system may generate a closeness matrix in which each row or column is a vector of closeness values of each view of the input object to a known object in the catalog, and the closeness matrix has a number of rows or columns equal to the number of labeled objects in the catalog.
In some embodiments, to generate the one or more candidate object labels for the unknown object, the system may generate an initial match hypothesis for the unknown object. The initial match hypothesis is a potential match with one of the known objects in the catalog. Then, when the user interfaces outputs at least one of the candidate object labels, the system will display at least display an image of a known object that corresponds to the initial match hypothesis. The system also may display images of a plurality of the known objects that are in a nearest neighbor network of the known object that corresponds to the initial match hypothesis.
The methods and systems described in this document provide a novel fusion of computer vision, machine learning and human-computer interaction methods to perform a complicated task of matching an unknown object observed in one or more images to a known object in a fixed catalog.
This methods and systems described below can be used to enhance various computer vision-related search applications. For example, the teachings of this disclosure can be particularly useful in applications where redundant observations of an object can be obtained, and the catalog to be searched is known in advance and generally stable. Example applications include: (a) classification of roadway objects such as traffic signs, guard rails, and street lights; (b) wildlife species identification; (c) identification of parts or objects in manufacturing or warehouse inventory; and (d) assessing a condition of any or all of the items described above.
Transportation infrastructure objects are objects that are installed on or near roadways, parking areas and other transportation networks to guide vehicles and their operators along a route. The transportation networks may include networks of roads on which vehicles travel; rail networks for trains, subways and the like; urban transportation networks which include roads and/or sidewalks, bike paths, and other transportation paths; and even networks of defined paths along which robotic devices and/or humans may travel, such as defined paths in a warehouse or other commercial or industrial facility, or a park or other recreational area. Some transportation infrastructure objects may be directional, such as street signs and upcoming exit signs. Other transportation infrastructure objects may and/or to notify the vehicles and their operators of traffic control measures, as with speed limit signs, stop signs, yield signs, and signs indicating that construction or other conditions are ahead along the route. The classification of transportation infrastructure objects, and in particular the problems associated with automated, detailed classification of traffic signs serves as a good example of the applicability of the methods and systems described above. In the United States, traffic signs are classified using a catalog (the Manual on Uniform Traffic Devices, or MUTCD catalog) that is comprised of over 1000 individual records, many of which contain only subtle visual differences. It is critical for safety and maintenance that each physical sign is associated with the correct catalog item. However, most modern approaches to image search (for example, a visual deep learning system based on one or more Convolutional Neural Networks) are not well-suited to this problem due to visual similarities in the catalog and significant variance in real-world input observations. Further, due to the large nature of the catalog, training such a network with sufficient accuracy would require an enormous set of priors (labels) and equivalent manual labor to prepare. As detailed below, this class of problem requires a more integrated approach that leverages visual similarity, confidence/scoring, and human reinforcement when necessary.
As shown in
The catalog 101 of known objects is available to the system. The system will then receive one or more images of an unknown object 105, which may be referred to in this document as an “input object.” The images of the unknown object 105 may represent multiple views of the unknown object. The goal of the system is to retrieve the best match for the unknown object 105 with one of the known objects in the finite sized catalog 101.
Optionally, before performing the match, the system may filter the catalog 101 to reduce the number of images that will be compared in a search function. In such a situation, the complete catalog 101 may be considered to be a superset catalog, and the result of the filtering will be a subset catalog 108 that contains a subset 109 of the known objects that are in the superset catalog. The filtering function may be a color filter 104 that yields a subset catalog 108 with a known objects subset 109 that includes only known objects having a primary color that corresponds to (i.e., matches or is similar to) the dominant color of the unknown input object 105. This can be especially useful in applications such as transportation infrastructure, as traffic control signs typically have a dominant color (for example, stop signs are predominantly red, many speed limit signs are predominantly white, and signs that provide cautionary information about potential road hazards ahead are predominantly yellow). The resulting subset catalog 108 will have fewer data points than the superset catalog 101 and therefore may be searched more quickly in the search function 107, which will be described in the following paragraph.
A system comprising a processor and a memory containing programming instruction applies a search function 107 to the input object (unknown object 105) and the catalog of known objects. At this step, the system may apply the search function to the subset catalog 108 as shown in
As shown in
O={o
(1)
,o
(2)
, . . . ,o(m)} (1)
As shown in
K={K
1
,K
2
, . . . ,K
z} (2)
Each known object's catalog item Ki, 303 in the superset catalog 101 will include or be associated with one or more images of the known object denoted as:
K
i
={k
i
(1)
,k
i
(2)
, . . . ,k
i
(n)} (3)
where ki(J) is the jth image belonging to the ith known object, K.
As noted above in the discussion of
K={K
1
,K
2
, . . . ,K
L} (4)
Images from an example subset catalog are shown in
X is a catalog feature matrix 403 containing all the feature values of all observations of the known objects in the catalog:
where [x1, . . . , xF(i)] is a vector of all feature values of the ith observation in the collection of known object observations, F is the total number of feature values, T=ΣI=1L
Returning to
where [y1(i), . . . , yF(j)] is a vector of all feature values of the jth observation of a detected object, o(j). (Note: the data contained in rows M and columns F of the matrix may be reversed, so that the data is arranged is F×M.)
Returning to
where each [d1(j), . . . , dT(j)] is a computed distance vector 422 of all feature values of the jth view of the unknown object 105, o(j), to all T views of the known objects in the catalog 101. The system may determine each distance vector as the Euclidean distance between the feature vectors from the catalog feature matrix 411 and the object feature matrix 415.
Each row of the distance matrix 423 corresponds to the jth view of the unknown object 105, o(j). To generate an initial match hypothesis, for each jth view of the unknown object 105, o(j), At 431 the system will first sort distances, from closest to farthest, to all T images in the catalog. At 431 the system will also assign a rank to each jth view based on the sorted distances, in which the highest rank is that which corresponds to the smallest distance. This is shown by way of example in
The rank vector for each known object Ki in the catalog corresponding to the jth view of the unknown object, o(j), is defined as:
R
Ki
j=[ri(1),Ri(2), . . . ,ri(n)] (8)
where n is the total of number images belonging to the known object Ki and ri(*) ∈ {1, 2, . . . , T}.
Each known object Ki in the catalog is assigned a rank of:
s
i
j=min(Rkij) (9)
where Rkij is obtained as in equation (8).
The final sorted matrix 432 (S) for L known objects in the catalog is represented as:
where [s1(j, . . . , sL(j)] is a vector of all sorted values of the jth view of the unknown object 105, o(j)), to all L known objects in the catalog 101.
Finally, at 433 the system will assign closeness values for each view of the unknown object 405 to the L known objects in the catalog 401. The closeness values may be a sorted, normalized rank value for each view with respect to the known object in the catalog. The system will thus output of a closeness matrix 434 (C):
where [c1(j), . . . , cL(j)] is a vector of closeness values of the jth view of the unknown object, o(j), to the L known objects in the catalog, each closeness value
and the range of closeness values c(j)∈[0, 1].
The system will then use the closeness matrix 434 (C) to generate a match hypothesis H for each view of the unknown object:
where the match hypothesis for the jth view of the unknown object 405 o(j) is a known object Ki in the catalog 401 if ci(j)≥T and T ∈ [0, 1].
Finally, referring again to
where h(j) is the hypothesis corresponding to the jth view of the unknown object 105, o(j), ci(m) is the closeness value between the known object Ki in the catalog 101 and the mu′ view of the unknown object 105 O(m), and T ∈ [0,1].
Referring back to
where Q(.) is a quality function. The quality function may include calculating the pixel area, brightness, contrast or no reference image quality measures such as those described in Mittal et al., “No-Reference Image Quality Assessment in the Spatial Domain”, IEEE Transactions on Image Processing (2012). In the confidence metric equation above, o(m) is the mth view of the unknown object 105, and M is the total number of views that correspond to the unknown object 105.
If the confidence metric exceeds a threshold (115: YES), the system will consider the initial match hypothesis to be the final match 117. If the confidence metric does not exceed the threshold (115: NO), the system will send the initial match hypothesis and object images to a reinforcement system 119.
A pre-computed neighbor network 501: Prior to receiving a new image of an object for analysis, the system will develop or receive a nearest neighbor network 501. In particular, for each known object in the catalog, a neighbor network with closest G match hypothesis (neighbors) within the catalog is computed through the search process described in the context of
A graphic user interface (UI) 503, which may be generated by a processor executing programming instructions referred to in this document as the “montage tool” 502, for displaying (a) an observation of the input object, (b) a view of the match hypothesis and a view of its G closest neighbors based on the neighbor network that was generated as described above. The montage tool 502 generates a user interface 503 showing one or more of the system's hypotheses—i.e., the most likely potential match or matches, with the number of possibilities presented selected based on any suitable criteria such as (i) the n matches having the highest confidence values, (ii) all matches having a confidence value exceeding a threshold, (iii) some combination of these, or using other criteria. In the first iteration, the match hypothesis is the result of the search algorithm, but in subsequent iterations the match hypothesis is manually selected from the G closest neighbors. The user interface may display an image of a known object that corresponds to the initial match hypothesis, along with images of known objects that are in the nearest neighbor network of the known object that corresponds to the initial match hypothesis.
This tool allows user input to either select one of the matches or nearest neighbors (505: YES) as the final match 117, and thus completing the object identification process. Or, a user of the reinforcement system 119 may identify a match as the best approximate match (505: NO), which triggers another iteration of the reinforcement system, based on that approximate match.
As described above, an image-based object identification system may include:
a finite-sized catalog comprising a memory storing one or more images for each known object;
one or more images of an observed object, consisting of one or more images of the object;
a memory containing programming instructions that are configured to cause a processor to execute:
(1) a feature extraction algorithm which generates feature vectors for all observed images and all catalog images;
(2) a relative scoring algorithm which generates one or more object hypotheses for an input image based on the closest matches between the image's feature vector and the feature vectors of the catalog images; and
(3) a weighted majority algorithm which combines the object hypotheses from one or more images of the same input object to find a final object hypothesis for the input object.
Applications of the process above may include a system having a finite set of known objects, so that the system can receive a new object and quickly attempt to match it to one of the known objects in the catalog.
The programming instructions will also cause the processor to generate a confidence metric for accepting the final hypothesis or requesting human intervention.
The processor will then cause a display device of the system to output a component of a visual UI for displaying the input object, the final object hypothesis and the closest catalog objects of the final object hypothesis.
The UI will also include user input mechanisms for a human operator to:
(a) choose a new object hypothesis from the objects shown in the UI;
(b) request a display update using the new object hypothesis; and
(c) finalize the object assignment based on one of the currently displayed objects.
An optional display interface 1030 may permit information from the bus 1000 to be displayed on a display device 1035 in visual, graphic or alphanumeric format. The display device may serve as a user interface of the reinforcement system 119 of
The hardware may also include a user interface sensor 1045 that allows for receipt of data from user interface input devices 1050 such as a keyboard, a mouse, a joystick, a touchscreen, a touch pad, a remote control, a pointing device and/or microphone. Digital image frames also may be received from a camera 1020 (which may be a component of image acquisition system 201 of
Terminology that is relevant to this disclosure includes:
An “electronic device” or a “computing device” refers to a device or system that includes a processor and memory. Each device may have its own processor and/or memory, or the processor and/or memory may be shared with other devices as in a virtual machine or container arrangement. The memory will contain or receive programming instructions that, when executed by the processor, cause the electronic device to perform one or more operations according to the programming instructions. Examples of electronic devices include personal computers, servers, mainframes, virtual machines, containers, gaming systems, televisions, digital home assistants and mobile electronic devices such as smartphones, fitness tracking devices, wearable virtual reality devices, Internet-connected wearables such as smart watches and smart eyewear, personal digital assistants, cameras, tablet computers, laptop computers, media players and the like. Electronic devices also may include appliances and other devices that can communicate in an Internet-of-things arrangement, such as smart thermostats, refrigerators, connected light bulbs and other devices. Electronic devices also may include components of vehicles such as dashboard entertainment and navigation systems, as well as on-board vehicle diagnostic and operation systems. In a client-server arrangement, the client device and the server are electronic devices, in which the server contains instructions and/or data that the client device accesses via one or more communications links in one or more communications networks. In a virtual machine arrangement, a server may be an electronic device, and each virtual machine or container also may be considered an electronic device. In the discussion above, a client device, server device, virtual machine or container may be referred to simply as a “device” for brevity. Additional elements that may be included in electronic devices are discussed above in the context of
The terms “processor” and “processing device” refer to a hardware component of an electronic device that is configured to execute programming instructions. Except where specifically stated otherwise, the singular terms “processor” and “processing device” are intended to include both single-processing device embodiments and embodiments in which multiple processing devices together or collectively perform a process.
The terms “memory,” “memory device,” “computer-readable medium,” “data storage facility”, and the like each refer to a non-transitory device on which computer-readable data, programming instructions or both are stored. The terms “catalog” and “data store” refer to memory device and data together. Except where specifically stated otherwise, the terms “memory,” “memory device,” “computer-readable medium,” “data store,” “data storage facility” and the like are intended to include single device embodiments, embodiments in which multiple memory devices together or collectively store a set of data or instructions, as well as individual sectors within such devices.
In this document, the terms “communication link” and “communication path” mean a wired or wireless path via which a first device sends communication signals to and/or receives communication signals from one or more other devices. Devices are “communicatively connected” if the devices are able to send and/or receive data via a communication link. “Electronic communication” refers to the transmission of data via one or more signals between two or more electronic devices, whether through a wired or wireless network, and whether directly or indirectly via one or more intermediary devices.
In this document, the term “imaging device” refers generally to a hardware sensor that is configured to acquire digital images. An imaging device may capture still and/or video images for inclusion in the image catalog described in the disclosure above. For example, an imaging device can be held by a user such as a DSLR (digital single lens reflex) camera, cell phone camera, or video camera. The imaging device may be part of an image capturing system that includes other hardware components. For example, an imaging device can be mounted on an accessory such as a monopod or tripod. The imaging device can also be mounted on a transporting vehicle such as an aerial drone, a robotic vehicle, or on a piloted aircraft such as a plane or helicopter having a transceiver that can send captured digital images to, and receive commands from, other components of the system.
As used in this document, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise. Unless defined otherwise, all technical and scientific terms used herein have the same meanings as commonly understood by one of ordinary skill in the art. As used in this document, the term “comprising” (or “comprises”) means “including (or includes), but not limited to.”
In this document, when terms such “first” and “second” are used to modify a noun, such use is simply intended to distinguish one item from another, and is not intended to require a sequential order unless specifically stated. The term “approximately,” when used in connection with a numeric value, is intended to include values that are close to, but not exactly, the number. For example, in some embodiments, the term “approximately” may include values that are within +/−10 percent of the value.
The features and functions described above, as well as alternatives, may be combined into many other different systems or applications. Various alternatives, modifications, variations or improvements may be made by those skilled in the art, each of which is also intended to be encompassed by the disclosed embodiments.
This patent document claims priority to U.S. patent application No. 63/159,031, filed Mar. 10, 2021, the disclosure of which is fully incorporated into this document by reference.
Number | Date | Country | |
---|---|---|---|
63159031 | Mar 2021 | US |