This application relates to analysis of digital images. More specifically, the application relates to the identification of objects within a digital image.
In an attempt to visually identify objects or features in a given space using automated systems, digital images of the space may be captured by an image sensing device. The digital images contain information representative of a field of view of the image sensing device, including objects that exist within the field of view. In addition to visual data, such as pixels representing the form and color of an object, other information may be included in a digital image. For example, three-dimensional (3D) information, including depth information relating to objects or features represented in the digital image may be included.
Frequently it may useful to analyze digital images to identify objects captured in the image. While objects may often be identifiable when the digital image is viewed by a human, there are applications where it is desired that the analysis and identification of objects within an image is performed by machines. For example, convolutional neural networks (CNNs) are sometimes used to analyze visual imagery. The identification of objects using machines is complex, as a machine must learn the properties that identify an object and correlate that information to the digital representation of the object in the image. This becomes even more challenging because a given object may be viewed by the image capture device from different perspectives relative to the object. A given perspective defines the object's pose, and objects may appear substantially different depending on the object pose in combination with the angle from which the object is viewed.
Typically, object pose estimation is performed by computing an image representation and searching a pre-existing database of image representations based on known poses. The database is constructed from a training set using machine learning techniques. A popular machine learning method for obtaining the representations is through convolutional neural networks, which may be trained in an end-to-end fashion. Adding to the challenge of identifying features in an image for practical applications, depth images are often cluttered with noise and spurious background information, rendering the global image representation ineffective. This may result in an inaccurate representation of the object. Methods and systems are desired to learn image representations that are not influenced by spurious background and noise present in depth images.
According to a method for learning view-invariant representations in a pair of images, the method comprises receiving a pair of images from a pair of image capture devices, generating a plurality of candidate patches in each image in the pair of images, arranging each of the candidate patches of a first image of the pair with each of the candidate patches of a second image of the pair of images to create a plurality of patch pairs, identifying features in the patches of each patch pairs, measuring a distance between a feature of the first patch in the patch pair to a corresponding feature of the second patch in the patch pair, comparing the distance between corresponding features in the patches of each pair of patches to a threshold, and labeling the pair of patches as positive or negative based on the comparison of the distance to the threshold.
According to an embodiment, each image of the pair of images is a depth image.
According to an embodiment, the method may further comprise projecting the identified features in the patches into three-dimensional space.
According to an embodiment, the method may further comprise labelling a patch pair as positive if the measured distance is less than the threshold and as negative if the measured distance is greater than the threshold.
According to an embodiment, the method may further comprise receiving intrinsic information relating to the image capture device used to capture the corresponding received image.
According to an embodiment, receiving the pair of images further comprises receiving pose information relating to the spatial position of the image capture device that captured the image.
According to an embodiment, the identified features are stored as a feature vector.
According to an embodiment, the method may further comprise outputting a set of labeled patch pairs, each labeled patch pair comprising a patch pair label identifying the patch pair, feature vectors associated with the patch pair, and a positive/negative label indicative of a correlation of a feature identified in the first patch of the patch pair and a feature identified in the second patch of the patch pair.
According to an embodiment, the plurality of candidate patches of the first image and the second image are generated by a pre-trained convolutional neural network (CNN).
According to an embodiment, the candidate patches of the first image are generated by a first CNN and the candidate patches of the second image are generated by a second CNN, the first and second CNNs being arranged in a Siamese network configuration.
According to an embodiment, the plurality of candidate patches of the first image and the candidate patches of the second image are selected based on a likelihood that the patch contains an object of interest captured in the image.
According to an embodiment, the pair of images are captured from one given space, and the first image is captured from a first perspective and the second image is captured from second perspective.
According to an embodiment, the method may further comprise providing a set of labeled patches to a visual analysis application.
According to an embodiment, the method may further comprise receiving an image in the visual analysis application, analyzing the received image to identify patches of interest in the received image that has a given likelihood to contain an object of interest, and comparing the patches of interest to a set of labeled patches to identify the object of interest.
According to an embodiment, the method may further comprise estimating an object pose in the received image based on the comparison to the set of labeled patches.
According to a system for generating view invariant image patch representations, the system comprises a first image capture device and a second image capture device. A Siamese convolutional neural network is configured to receive a first image from the first image capture device and a second image from the second image capture device and generate a plurality of candidate patches. A sampling layer is configured to receive a plurality of candidate patches from a first CNN, and a plurality of candidate patches from the second CNN, the sampling layer arranges the candidate patches in pairs, compares distances between features in each patch of the pair of patches and labels each pair of patches as positive or negative based on a comparison of the distances to a threshold. In an embodiment, the system further comprises a set of weights applied to the first CNN and the second CNN. According to another embodiment, the system comprises a visual analysis application configured to receive a set of labeled patches and an image, the visual analysis application configured to identify a pose of an object of interest in the image based on a comparison of the image to the set of labeled patches. In an embodiment, the system may further comprise a set of labeled patch pairs created by the sampling layer. Each labeled patch pair comprises a label identifying the first patch and the second patch associated with the patch pair, a first feature vector associated with the first patch, a second feature vector associated with the second patch, and a binary label associated with the pair of patches, the binary label indicative of a positive or negative correlation between the first feature vector and the second feature vector. According to aspects of another embodiment, the system further comprises a visual analysis application configured to receive the set of labeled patches and a captured image and produce an object pose for an object in the captured image based on the set of labeled patches.
The foregoing and other aspects of the present invention are best understood from the following detailed description when read in connection with the accompanying drawings. For the purpose of illustrating the invention, there is shown in the drawings embodiments that are presently preferred, it being understood, however, that the invention is not limited to the specific instrumentalities disclosed. Included in the drawings are the following Figures:
To address the above challenges affecting the learning of representations in depth images, an approach for learning useful local patch representations that can be matched among images from different viewpoints of the same object is described herein. The local patches and their representations are generated from a deep convolutional network that is pre-trained for generating object proposals. Patches represent contiguous groups of pixels which include a subset of pixels in the entire captured image. Patches may be selected such that selected patches are more likely to contain features of interest in the space captured in the depth image. Throughout this description, the term patch(es) and box(es) are used interchangeably to identify a region of a depth image, the region being a subset of the full image.
According to embodiments of this disclosure, two depth images are captured from a given space, the two depth images being captured from different perspectives. Boxes contained in the two depth images are analyzed to identify candidate regions of the images that may contain features of the captured space that are of interest for identification or further analysis. After establishing pairs between patches of the two images from different poses, analysis is performed with the goal of minimizing a distance in feature space between patches that constitute correspondences (a positive correlation), and maximizing a distance between non-correspondence patches (a negative correlation). Using two test images, local patches are generated in each test image, and a nearest neighbor search of the learned features space is performed to find reliable matches. Once reliable matches are identified, the exact relative pose between the two images may be estimated based on the feature vectors of the corresponding patch pair. An approach for learning view-invariant local patch representations for use in 3D pose estimation based on depth images captured using structured light sensors will now be described.
The network may be further trained using a contrastive loss technique 123, which attempts to minimize the distance in the feature space between positive pairs, and maximize the distance between negative pairs. The contrastive loss function 123 further provides feedback for adjusting weights 130 used by the two branches 100, 110. Accordingly, for patches that are very close in the 3D space but sampled from different image perspectives, a representation may be learned that has a minimal distance in the feature space.
According to embodiments of the present invention, patches from two input depth images are arranged in pairs and each pair is evaluated to determine if features contained in each patch of the pair correspond and represent the same feature in the 3D space. If features contained in each patch of the pair of patches are found to correspond to the same actual object, a positive label is associated with that pair of patches. If features in each patch of the pair of patches are found to correspond to dissimilar objects, the pair of patches is associated with a negative label.
Referring again to
A set of patch representation pairs and corresponding positive or negative correlation labels may be provided as an output of the sampling layer shown in
As shown in
The processors 720 may include one or more central processing units (CPUs), graphical processing units (GPUs), or any other processor known in the art. More generally, a processor as used herein is a device for executing machine-readable instructions stored on a computer readable medium, for performing tasks and may comprise any one or combination of, hardware and firmware. A processor may also comprise memory storing machine-readable instructions executable for performing tasks. A processor acts upon information by manipulating, analyzing, modifying, converting or transmitting information for use by an executable procedure or an information device, and/or by routing the information to an output device. A processor may use or comprise the capabilities of a computer, controller or microprocessor, for example, and be conditioned using executable instructions to perform special purpose functions not performed by a general-purpose computer. A processor may be coupled (electrically and/or as comprising executable components) with any other processor enabling interaction and/or communication there-between. A user interface processor or generator is a known element comprising electronic circuitry or software or a combination of both for generating display images or portions thereof. A user interface comprises one or more display images enabling user interaction with a processor or other device.
Continuing with reference to
The computer system 710 also includes a disk controller 740 coupled to the system bus 721 to control one or more storage devices for storing information and instructions, such as a magnetic hard disk 741 and a removable media drive 742 (e.g., floppy disk drive, compact disc drive, tape drive, and/or solid-state drive). Storage devices may be added to the computer system 710 using an appropriate device interface (e.g., a small computer system interface (SCSI), integrated device electronics (IDE), Universal Serial Bus (USB), or FireWire).
The computer system 710 may also include a display controller 765 coupled to the system bus 721 to control a display or monitor 766, such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user. The computer system includes input interface 760 and one or more input devices, such as a keyboard 762 and a pointing device 761, for interacting with a computer user and providing information to the processors 720. The pointing device 761, for example, may be a mouse, a light pen, a trackball, or a pointing stick for communicating direction information and command selections to the processors 720 and for controlling cursor movement on the display 766. The display 766 may provide a touch screen interface which allows input to supplement or replace the communication of direction information and command selections by the pointing device 761. In some embodiments, an augmented reality device 767 that is wearable by a user, may provide input/output functionality allowing a user to interact with both a physical and virtual world. The augmented reality device 767 is in communication with the display controller 765 and the user input interface 760 allowing a user to interact with virtual items generated in the augmented reality device 767 by the display controller 765. The user may also provide gestures that are detected by the augmented reality device 767 and transmitted to the user input interface 760 as input signals.
The computer system 710 may perform a portion or all of the processing steps of embodiments of the invention in response to the processors 720 executing one or more sequences of one or more instructions contained in a memory, such as the system memory 730. Such instructions may be read into the system memory 730 from another computer readable medium, such as a magnetic hard disk 741 or a removable media drive 742. The magnetic hard disk 741 may contain one or more datastores and data files used by embodiments of the present invention. Datastore contents and data files may be encrypted to improve security. The processors 720 may also be employed in a multi-processing arrangement to execute the one or more sequences of instructions contained in system memory 730. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions. Thus, embodiments are not limited to any specific combination of hardware circuitry and software.
As stated above, the computer system 710 may include at least one computer readable medium or memory for holding instructions programmed according to embodiments of the invention and for containing data structures, tables, records, or other data described herein. The term “computer readable medium” as used herein refers to any medium that participates in providing instructions to the processors 720 for execution. A computer readable medium may take many forms including, but not limited to, non-transitory, non-volatile media, volatile media, and transmission media. Non-limiting examples of non-volatile media include optical disks, solid state drives, magnetic disks, and magneto-optical disks, such as magnetic hard disk 741 or removable media drive 742. Non-limiting examples of volatile media include dynamic memory, such as system memory 730. Non-limiting examples of transmission media include coaxial cables, copper wire, and fiber optics, including the wires that make up the system bus 721. Transmission media may also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications.
The computing environment 700 may further include the computer system 710 operating in a networked environment using logical connections to one or more remote computers, such as remote computing device 780. Remote computing device 780 may be a personal computer (laptop or desktop), a mobile device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to computer system 710. When used in a networking environment, computer system 710 may include modem 772 for establishing communications over a network 771, such as the Internet. Modem 772 may be connected to system bus 721 via user network interface 770, or via another appropriate mechanism.
Network 771 may be any network or system generally known in the art, including the Internet, an intranet, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a direct connection or series of connections, a cellular telephone network, or any other network or medium capable of facilitating communication between computer system 710 and other computers (e.g., remote computing device 780). The network 771 may be wired, wireless or a combination thereof. Wired connections may be implemented using Ethernet, Universal Serial Bus (USB), RJ-6, or any other wired connection generally known in the art. Wireless connections may be implemented using Wi-Fi, WiMAX, and Bluetooth, infrared, cellular networks, satellite or any other wireless connection methodology generally known in the art. Additionally, several networks may work alone or in communication with each other to facilitate communication in the network 771.
An executable application, as used herein, comprises code or machine-readable instructions for conditioning the processor to implement predetermined functions, such as those of an operating system, a context data acquisition system or other information processing system, for example, in response to user command or input. An executable procedure is a segment of code or machine-readable instruction, sub-routine, or other distinct section of code or portion of an executable application for performing one or more particular processes. These processes may include receiving input data and/or parameters, performing operations on received input data and/or performing functions in response to received input parameters, and providing resulting output data and/or parameters.
A graphical user interface (GUI), as used herein, comprises one or more display images, generated by a display processor and enabling user interaction with a processor or other device and associated data acquisition and processing functions. The GUI also includes an executable procedure or executable application. The executable procedure or executable application conditions the display processor to generate signals representing the GUI display images. These signals are supplied to a display device which displays the image for viewing by the user. The processor, under control of an executable procedure or executable application, manipulates the GUI display images in response to signals received from the input devices. In this way, the user may interact with the display image using the input devices, enabling user interaction with the processor or other device.
The functions and process steps herein may be performed automatically or wholly or partially in response to user command. An activity (including a step) performed automatically is performed in response to one or more executable instructions or device operation without user direct initiation of the activity.
The system and processes of the figures are not exclusive. Other systems, processes and menus may be derived in accordance with the principles of the invention to accomplish the same objectives. Although this invention has been described with reference to particular embodiments, it is to be understood that the embodiments and variations shown and described herein are for illustration purposes only. Modifications to the current design may be implemented by those skilled in the art, without departing from the scope of the invention. As described herein, the various systems, subsystems, agents, managers and processes can be implemented using hardware components, software components, and/or combinations thereof. No claim element herein is to be construed under the provisions of 35 U.S.C. 112, sixth paragraph, unless the element is expressly recited using the phrase “means for.”
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2018/013271 | 1/11/2018 | WO | 00 |