This application relates to artificial intelligence. More particularly, this application relates to applying artificial intelligence to a visual recognition system.
Visual recognition systems can be used for image analytics applications like image search and retrieval. There are problems to overcome in such applications, including image registration, 2D/3D object recognition, and pose estimation.
Computer vision systems may extract features detected in images for comparison to known features stored as feature vectors in a data library, where the set of all possible feature vectors is organized as a feature space. When enough features are matched, the object in the image can be classified according to classification training imposed on the network. The feature space used to represent images in applications such as object recognition and pose estimation plays a critical role for generating training data to be used during a machine learning process. Specifically, in real-world scenarios, test images captured by sensors are often cluttered with spurious background and noise, regardless of image modalities. Such factors can play a mitigating role in the success of an automatic system to recognize objects or estimate pose of objects. For instance, using a global feature representation for the entire image would likely consider these noise sources, resulting in an inaccurate feature space representation of the image. In summary, conventional training of visual recognition systems fail to automatically detect accurate features in conjunction with learning feature representations, particularly when test images include sensor noise.
Aspects according to embodiments of the present disclosure include a process and a system to automatically learn feature representations and detect keypoints in image data using an end-to-end machine learning process. The image data can be of any modality, including RGB images or depth images for example. The process proceeds in an iterative fashion, updating and refining the feature representation and keypoint detection in each round until convergence, where keypoints may be defined as points in an image best suited for object recognition analysis by a computer vision process.
In an initialization cycle, given a 3D rendering of a physical environment (e.g., a CAD rendering of a system of components which are target objects for object recognition), random 2D images of various viewpoint poses may be rendered to simulate various perspectives useful for finding candidate keypoints. The rendered images may be used to generate training data for learning viewpoint invariant feature representations from which keypoints may be generated. For example, given a test image having viewpoint pose information, a point in the image may be randomly sampled and a correspondence in a rendered image of another pose may be determined. The rendered image may be generated by perturbing the given pose of the test image. Locating local patches around the two corresponding points may generate a pair of similar patches, which may train a convolutional neural network to learn a feature representation that is viewpoint invariant. Sample keypoints may be randomly selected from a test image, and compared to random keypoints of reference images to generate data to train a keypoint detector network. The previously trained feature representation network may be used to process each candidate keypoint and reference keypoint for assigning a score to the candidate keypoint. This score is representative of two key properties of keypoints: repeatability and uniqueness. The data generated in this fashion may be used to train the keypoint detector network. After the feature representation network and the keypoint detector network are trained by the initialization phase, an iterative refinement may be performed in subsequent cycles. Using the keypoint detector, keypoints in the images may be detected and patches may be sampled around these keypoints. These patches may be used as input to refine the feature representation network. This iterative procedure of refining the feature representation network and keypoint detector network may be repeated until convergence.
The advantage of iterative process of keypoint detection and feature representations in a circular manner for semi-supervised training of the machine learning networks is for efficient and automatic generation of labeled training data without the cost and effort of manual/expert label generation, while circumventing the issues of noisy training images captured by cameras commonly used in conventional applications.
Non-limiting and non-exhaustive embodiments of the present embodiments are described with reference to the following FIGURES, wherein like reference numerals refer to like elements throughout the drawings unless otherwise specified.
Methods and systems are disclosed for visual recognition of objects in images using machine learning to train a first network to detect strong localized features, called keypoints, in the presence of image noise caused by visual sensors. Conventional object recognition systems that employ keypoint matching are hindered by sensor noise and tend to identify weak keypoints that may be representative of a feature influenced by the noise instead of an actual physical feature of the target object. A second network may be trained by machine learning to determine viewpoint invariant feature representations of patches around the detected keypoints in a target image. The advantage of viewpoint invariance is to enhance recognition of regions in a target image which correspond to stored feature representations of the regions in an inventory library where the stored feature representations may be of a viewpoint not identical to the target image. Multiple cycles of keypoint detection and feature representation generation may be applied to the first and second networks for learning refinement until convergence. Application of the trained first and second networks of the visual recognition system include processing input images to find strong keypoints within the images, identify the feature representations of the images based on the keypoints, and identify objects in the image based on matching the feature representations to a library of objects indexed by the features.
Training data for the keypoint detector network may be generated to produce labeled keypoints. In an embodiment, random keypoints may be matched using the trained feature representation network 410. For example, given a test image with pose information, and a randomly selected test keypoint for which a score label is to be assigned, the test image may be randomly perturbed to obtain P unique viewpoint poses (e.g., P=100). Next, a set of P reference patches may be randomly sampled, one in each of the P viewpoint poses. Given a patch around the test keypoint in the test image, a test feature representation (i.e., a test feature vector) may be generated using the feature representation network 410. For example, the test keypoint patch may be fed as patch 411 to the network 410 to generate and output 421 in the form of a feature vector. Next, each of the P reference patches may be fed to the feature representation network 410 (e.g., as patch 411) to generate P corresponding feature vectors (e.g., output 421), respectively. A feature space distance between the test patch and the P reference patches may be computed, resulting in a P-dimensional distance vector. For example, an objective function 431 may produce a scalar value for each distance comparison and generate a distance vector using the set of distance values. The P-dimensional distance vector may be used in conjunction with a scoring scheme to identify the uniqueness and repeatability of the test keypoint in the test image. Specifically, the distribution of the distance vector may be stored as a histogram with a particular bin size (e.g., 0.1). A scoring scheme may admit a score value S, defined by the following equation:
S=1−(k/N) Equation 1
where
The processors 820 may include one or more central processing units (CPUs), graphical processing units (GPUs), or any other processor known in the art. More generally, a processor as described herein is a device for executing machine-readable instructions stored on a computer readable medium, for performing tasks and may comprise any one or combination of, hardware and firmware. A processor may also comprise memory storing machine-readable instructions executable for performing tasks. A processor acts upon information by manipulating, analyzing, modifying, converting or transmitting information for use by an executable procedure or an information device, and/or by routing the information to an output device. A processor may use or comprise the capabilities of a computer, controller or microprocessor, for example, and be conditioned using executable instructions to perform special purpose functions not performed by a general purpose computer. A processor may include any type of suitable processing unit including, but not limited to, a central processing unit, a microprocessor, a Reduced Instruction Set Computer (RISC) microprocessor, a Complex Instruction Set Computer (CISC) microprocessor, a microcontroller, an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), a System-on-a-Chip (SoC), a digital signal processor (DSP), and so forth. Further, the processor(s) 820 may have any suitable microarchitecture design that includes any number of constituent components such as, for example, registers, multiplexers, arithmetic logic units, cache controllers for controlling read/write operations to cache memory, branch predictors, or the like. The microarchitecture design of the processor may be capable of supporting any of a variety of instruction sets. A processor may be coupled (electrically and/or as comprising executable components) with any other processor enabling interaction and/or communication there-between. A user interface processor or generator is a known element comprising electronic circuitry or software or a combination of both for generating display images or portions thereof. A user interface comprises one or more display images enabling user interaction with a processor or other device.
The system bus 821 may include at least one of a system bus, a memory bus, an address bus, or a message bus, and may permit exchange of information (e.g., data (including computer-executable code), signaling, etc.) between various components of the computer system 810. The system bus 821 may include, without limitation, a memory bus or a memory controller, a peripheral bus, an accelerated graphics port, and so forth. The system bus 821 may be associated with any suitable bus architecture including, without limitation, an Industry Standard Architecture (ISA), a Micro Channel Architecture (MCA), an Enhanced ISA (EISA), a Video Electronics Standards Association (VESA) architecture, an Accelerated Graphics Port (AGP) architecture, a Peripheral Component Interconnects (PCI) architecture, a PCI-Express architecture, a Personal Computer Memory Card International Association (PCMCIA) architecture, a Universal Serial Bus (USB) architecture, and so forth.
Continuing with reference to
The operating system 834 may be loaded into the memory 830 and may provide an interface between other application software executing on the computer system 810 and hardware resources of the computer system 810. More specifically, the operating system 834 may include a set of computer-executable instructions for managing hardware resources of the computer system 810 and for providing common services to other application programs (e.g., managing memory allocation among various application programs). In certain example embodiments, the operating system 834 may control execution of one or more of the program modules depicted as being stored in the data storage 840. The operating system 834 may include any operating system now known or which may be developed in the future including, but not limited to, any server operating system, any mainframe operating system, or any other proprietary or non-proprietary operating system.
The application programs 835 may a set of computer-executable instructions for performing the iterative keypoint and viewpoint invariant feature learning for visual recognition process in accordance with embodiments of the disclosure.
The computer system 810 may also include a disk/media controller 843 coupled to the system bus 821 to control one or more storage devices for storing information and instructions, such as a magnetic hard disk 841 and/or a removable media drive 842 (e.g., floppy disk drive, compact disc drive, tape drive, flash drive, and/or solid state drive). Storage devices 840 may be added to the computer system 810 using an appropriate device interface (e.g., a small computer system interface (SCSI), integrated device electronics (IDE), Universal Serial Bus (USB), or FireWire). Storage devices 841, 842 may be external to the computer system 810, and may be used to store image processing data in accordance with the embodiments of the disclosure.
The computer system 810 may also include a display controller 865 coupled to the system bus 821 to control a display or monitor 866, such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user. The computer system includes a user input interface 860 and one or more input devices, such as a user terminal 861, which may include a keyboard, touchscreen, tablet and/or a pointing device, for interacting with a computer user and providing information to the processors 820. The display 866 may provide a touch screen interface which allows input to supplement or replace the communication of direction information and command selections by the user terminal device 861.
The computer system 810 may perform a portion or all of the processing steps of embodiments of the invention in response to the processors 820 executing one or more sequences of one or more instructions contained in a memory, such as the system memory 830. Such instructions may be read into the system memory 830 from another computer readable medium, such as the magnetic hard disk 841 or the removable media drive 842. The magnetic hard disk 841 may contain one or more data stores and data files used by embodiments of the present invention. The data store may include, but are not limited to, databases (e.g., relational, object-oriented, etc.), file systems, flat files, distributed data stores in which data is stored on more than one node of a computer network, peer-to-peer network data stores, or the like. The processors 820 may also be employed in a multi-processing arrangement to execute the one or more sequences of instructions contained in system memory 830. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions. Thus, embodiments are not limited to any specific combination of hardware circuitry and software.
As stated above, the computer system 810 may include at least one computer readable medium or memory for holding instructions programmed according to embodiments of the invention and for containing data structures, tables, records, or other data described herein. The term “computer readable medium” as used herein refers to any medium that participates in providing instructions to the processors 820 for execution. A computer readable medium may take many forms including, but not limited to, non-transitory, non-volatile media, volatile media, and transmission media. Non-limiting examples of non-volatile media include optical disks, solid state drives, magnetic disks, and magneto-optical disks, such as magnetic hard disk 841 or removable media drive 842. Non-limiting examples of volatile media include dynamic memory, such as system memory 830. Non-limiting examples of transmission media include coaxial cables, copper wire, and fiber optics, including the wires that make up the system bus 821. Transmission media may also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications.
Computer readable medium instructions for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.
Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, may be implemented by computer readable medium instructions.
The computing environment 800 may further include the computer system 810 operating in a networked environment using logical connections to one or more remote computers, such as remote computing device 880. The network interface 870 may enable communication, for example, with other remote devices 880 or systems and/or the storage devices 841, 842 via the network 871. Remote computing device 880 may be a personal computer (laptop or desktop), a mobile device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to computer system 810. When used in a networking environment, computer system 810 may include modem 872 for establishing communications over a network 871, such as the Internet. Modem 872 may be connected to system bus 821 via user network interface 870, or via another appropriate mechanism.
Network 871 may be any network or system generally known in the art, including the Internet, an intranet, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a direct connection or series of connections, a cellular telephone network, or any other network or medium capable of facilitating communication between computer system 810 and other computers (e.g., remote computing device 880). The network 871 may be wired, wireless or a combination thereof. Wired connections may be implemented using Ethernet, Universal Serial Bus (USB), RJ-6, or any other wired connection generally known in the art. Wireless connections may be implemented using Wi-Fi, WiMAX, and Bluetooth, infrared, cellular networks, satellite or any other wireless connection methodology generally known in the art. Additionally, several networks may work alone or in communication with each other to facilitate communication in the network 871.
It should be appreciated that the program modules, applications, computer-executable instructions, code, or the like depicted in
An executable application, as used herein, comprises code or machine readable instructions for conditioning the processor to implement predetermined functions, such as those of an operating system, a context data acquisition system or other information processing system, for example, in response to user command or input. An executable procedure is a segment of code or machine readable instruction, sub-routine, or other distinct section of code or portion of an executable application for performing one or more particular processes. These processes may include receiving input data and/or parameters, performing operations on received input data and/or performing functions in response to received input parameters, and providing resulting output data and/or parameters.
The functions and process steps herein may be performed automatically or wholly or partially in response to user command. An activity (including a step) performed automatically is performed in response to one or more executable instructions or device operation without user direct initiation of the activity.
The system and processes of the figures are not exclusive. Other systems, processes and menus may be derived in accordance with the principles of the invention to accomplish the same objectives. Although this invention has been described with reference to particular embodiments, it is to be understood that the embodiments and variations shown and described herein are for illustration purposes only. Modifications to the current design may be implemented by those skilled in the art, without departing from the scope of the invention. As described herein, the various systems, subsystems, agents, managers and processes can be implemented using hardware components, software components, and/or combinations thereof. No claim element herein is to be construed under the provisions of 35 U.S.C. 112(f), unless the element is expressly recited using the phrase “means for.”
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2018/040668 | 7/3/2018 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62528672 | Jul 2017 | US |