The present invention generally relates to the field of image processing. More particularly, the present invention relates to image processing using high-level image information.
Understanding the meanings and contents of images remains one of the most challenging problems in machine intelligence and statistical learning. Contrast to inference tasks in other domains, such as NLP, where the basic feature space in which the data lie usually bears explicit human perceivable meaning, e.g., each dimension of a document embedding space could correspond to a word, or a topic, common representations of visual data primarily build on raw physical metrics of the pixels such as color and intensity, or their mathematical transformations such as various filters, or simple image statistics such as shape, and edges orientations among other things. Depending on the specific visual inference task, such as classification, a predictive method is deployed to pool together and model the statistics of the image features, and make use of them to build some hypothesis for the predictor.
Robust low-level image features have been effective representations for a variety of visual recognition tasks such as object recognition and scene classification, but pixels, or even local image patches, carry little semantic meanings. For high-level visual tasks, such low-level image representations may not be satisfactory.
Much work has been performed in the area of image classification or feature identification in images. For example, toward identifying features in an image, significant work has been performed on low-level features of an image. To the extent digital images are a collection of pixels, much work has been performed on how a collection of many pixels provides visual information. It is, therefore, a goal of such methods to take low-level information and generate higher-level information about the image. Indeed, some of the results generated by low-level analysis can be difficult for a human-perceived analysis of an image, for example, a radiographic image containing very small speculations that may be indicative of a cancerous tumor.
But it can also be desirable to identify higher-level information about an image that is visually obtained from a lay person. For example, a viewer can readily identify everyday objects in a photograph that may contain, for example, people, houses, animals, and other objects. Moreover, a viewer can readily identify context in an image, for example, a sporting event, an activity, a task, etc. It can, therefore, be desirable to identify high-level features in an image that could be appreciated by viewers so that they may be retrieved upon a query, for example.
Recognizing and analyzing certain high-level information in images can be difficult for prior art low-level algorithms. But the present invention takes a different approach. Rather than relying strictly on low-level information, the present invention makes use of high-level information from a collection of images. Among other things, the present invention uses many object detectors at different image location and scale to represent features in images.
The present invention generally relates to understanding the meaning and content of images. More particularly, the present invention relates to a method for the representation of images based on known objects. The present invention uses a collection of object sensing filters to classify scenes in an image or to provide information on semantic features of the image. The present invention provides useful results in performing high-level visual recognition tasks in cluttered scenes. Among other things, the present invention is able to provide this information by making use of known datasets of images.
An embodiment of the present invention generates an Object Bank that is an image representation constructed from the response of multiple object detectors. For example, an object detector could detect the presence of “blobby” objects such as tables, cars, humans, etc. Alternatively, an object detector can be a texture classifier optimized for detecting sky, road, sand, etc. In this way, the Object Bank contains generalized high-level information, e.g., semantic information, about objects in images.
In an embodiment, a collection of images from a complex dataset are used to train the classification algorithm of the present invention. Thereafter, an image having unknown content is input. The algorithm of the present invention then provides classification information about the scene in the image. For example, the algorithm of the present invention can be trained with images of sporting activities so as to identify the types of activities, e.g., skiing, snowboarding, rock climbing, etc., shown in an image.
Results from the present invention, indicate that, in certain recognition tasks, it performs better than certain low-level feature extraction algorithms. In particular, the present invention provides better results in classification tasks that may have similar low-level information but different high-level information. For example, certain low-level prior art algorithms may struggle to distinguish a bedroom image from a living room image because much of the low-level information, e.g., texture, is similar in both types of images. The present invention, however, can make use of certain high-level information about the objects in the image, e.g., bed or table, and their arrangement to distinguish between the two scenes.
In an embodiment, the present invention makes use of a high-level image representation where an image is represented as a scale-invariant response map of a large number of pre-trained object detectors, blind to the testing dataset or visual task. Using the Object Bank representation, improved performance on high-level visual recognition tasks can be achieved with off-the-shelf classifiers such as logistic regression and linear SVM.
The following drawings will be used to more fully describe embodiments of the present invention.
Among other things, the present disclosure relates to methods, techniques, and algorithms that are intended to be implemented in a digital computer system 100 such as generally shown in
Computer system 100 may include at least one central processing unit 102 but may include many processors or processing cores. Computer system 100 may further include memory 104 in different forms such as RAM, ROM, hard disk, optical drives, and removable drives that may further include drive controllers and other hardware. Auxiliary storage 112 may also be include that can be similar to memory 104 but may be more remotely incorporated such as in a distributed computer system with distributed memory capabilities.
Computer system 100 may further include at least one output device 108 such as a display unit, video hardware, or other peripherals (e.g., printer). At least one input device 106 may also be included in computer system 100 that may include a pointing device (e.g., mouse), a text input device (e.g., keyboard), or touch screen.
Communications interfaces 114 also form an important aspect of computer system 100 especially where computer system 100 is deployed as a distributed computer system. Computer interfaces 114 may include LAN network adapters, WAN network adapters, wireless interfaces, Bluetooth interfaces, modems and other networking interfaces as currently available and as may be developed in the future.
Computer system 100 may further include other components 116 that may be generally available components as well as specially developed components for implementation of the present invention. Importantly, computer system 100 incorporates various data buses 116 that are intended to allow for communication of the various components of computer system 100. Data buses 116 include, for example, input/output buses and bus controllers.
Indeed, the present invention is not limited to computer system 100 as known at the time of the invention. Instead, the present invention is intended to be deployed in future computer systems with more advanced technology that can make use of all aspects of the present invention. It is expected that computer technology will continue to advance but one of ordinary skill in the art will be able to take the present disclosure and implement the described teachings on the more advanced computers as they become available. Moreover, the present invention may be implemented on one or more distributed computers. Still further, the present invention may be implemented in various types of software languages including C, C++, and others. Also, one of ordinary skill in the art is familiar with compiling software source code into executable software that may be stored in various forms and in various media (e.g., magnetic, optical, solid state, etc.). One of ordinary skill in the art is familiar with the use of computers and software languages and, with an understanding of the present disclosure, will be able to implement the present teachings for use on a wide variety of computers.
The present disclosure provides a detailed explanation of the present invention with detailed formulas and explanations that allow one of ordinary skill in the art to implement the present invention into a computer learning method. For example, the present disclosure provides detailed indexing schemes that readily lend themselves to multi-dimensional arrays for storing and manipulating data in a computerized implementation. Certain of these and other details are not included in the present disclosure so as not to detract from the teachings presented herein but it is understood that one of ordinary skill in the at would be familiar with such details.
Turning now more particularly to image processing, conventional image and scene classification has been done at low levels such as generally shown in
While more sophisticated low-level feature engineering and recognition model design remain important sources of future developments, the use of semantically more meaningful feature space, such as one that is directly based on the content (e.g., objects) of the images, as words for textual documents, can offer another venue to empower a computational visual recognizer to handle arbitrary natural images, especially in our current era where visual knowledge of millions of common objects are readily available from various easy sources on the Internet.
Rather than making use of only low-level features, the present invention makes use of high-level features (e.g., objects in an image) to better classify images. Shown in
The Object Bank (also called “OB”) of the present invention makes use of a representation of natural images based on objects, or more rigorously, a collection of object sensing filters built on a generic collection of labeled objects.
The present invention provides an image representation based on objects that is useful in high-level visual recognition tasks for scenes cluttered with objects. The present invention provides complementary information to that of the low-level features.
While the OB representation of the present invention offers a rich, high-level description of images, a key technical challenge due to this representation is the “curse of dimensionality,” which is severe because of the size (i.e., number of objects) of the object bank and the dimensionality of the response vector for each object. Typically, for a modestly sized picture, even hundreds of object detectors can result in a representation of tens of thousands of dimensions. Therefore, to achieve a robust predictor on a practical dataset with typically only dozens or a few hundreds of instances per class, structural risk minimization via appropriate regularization of the predictive model is important. In an embodiment, the present invention can be implemented with or without compression.
The present invention provides an Object Bank that is an image representation constructed from the responses of many object detectors, which can be viewed as the response of a “generalized object convolution.” In an embodiment, two detectors are used for this operation. More particularly, latent SVM object detector and a texture classifier are used. One of ordinary skill will, however, recognize that other detectors can be used without deviating from the teachings of the present invention. The latent SVM object detectors are useful for detecting blobby objects such as tables, cars, and humans among other things. The texture classifier is useful for more texture- and material-based objects such as sky, road, and sand among other things.
As used in the present disclosure, “object” is used in its most general form to include, for example, things such as cars and dogs but also other things such as sky and water. Also, the image representation of the present invention is generally agnostic to any specific type of object detector.
Certain object names as may be used in the Object Bank of the present invention are shown in
The image processing algorithm of the present invention, therefore, introduces a shift in the manner of processing images. Whereas conventional image processing operates at low levels (e.g., pixel level), the present invention operates at a higher level (e.g., object level). Shown in
Finally, a selected number of Object Bank responses 818 are shown with varying levels of response for the different images 802 and 804. As illustrated in
Given the availability of large-scale image datasets such as LabelMe and ImageNet, trained object detectors can be obtained for a large number of visual concepts. In fact, as databases grow and computational power improves thousands if not millions of object detectors can be developed for use in accordance with the present invention.
In an embodiment, 200 object detectors are used at 12 detection scales and 3 spatial pyramid levels (L=0,1,2). This is a general representation that can be applicable to many images and tasks. The same set of object detectors can be used for many scenes and datasets. In other embodiments, the number of object detectors is in the range from 100 to 300. In still other embodiments, images are scaled in the range from 5 to 20 times. In still other embodiments, up to 10 spatial pyramid levels are used.
Many or substantially all types of objects can be used in the Object Bank of the present invention. Indeed, as the detectors continue to become more robust, especially with the emergence of large-scale datasets such as LabelMe and ImageNet, use of substantially all types of objects becomes more feasible.
But computational intensity and computation time, among other things, can limit the types of objects to use. For example, the use of all the objects in the LabelMe dataset may be computationally intensive and presently infeasible. As computational power and computational techniques improve, however, larger datasets may be used in accordance with the present invention.
As shown in graph 902,
In an embodiment, a few hundred of the most useful (or popular) objects in images were used. An practical consideration is ensuring the availability of enough training images for each object detector. Such embodiment, therefore, focuses attention on obtaining the objects from popular image datasets such as ESP, LabelMe, ImageNet and the Flickr online photo sharing community, for example.
After ranking the objects according to their frequencies in each of these datasets, an embodiment of the present invention takes the intersection set of the most frequent 1000 objects, resulting in 200 objects, where the identities and semantic relations of some of them are as shown with reference to
To train each of the 200 object detectors, 100-200 images and their object bounding box information were used from the LabelMe (86 objects) and ImageNet datasets (177 objects). A subset of the LabelMe scene dataset was used to evaluate the object detector performance. Final object detectors are selected based on their performance on the validation set from LabelMe. Shown in
The OB representation was evaluated and shown to have improved results on four scene datasets, ranging from generic natural scene images (15-Scene, LabelMe 9-class scene dataset), to cluttered indoor images (MIT Indoor Scene), and to complex event and activity images (UIUC-Sports). From 100 popular scene names, nine classes were obtained from the LabelMe dataset in which there are more than 100 images, e.g., beach, mountain, bathroom, church, garage, office, sail, street, and forest. The maximum number of images in those classes is 1000.
Scene classification performance was evaluated by average multi-way classification accuracy over all scene classes in each dataset. Below is a list of the various experiment settings for each dataset:
OB in scene classification tasks were compared with different types of conventional image features such as SIFT-BoW, GIST and SPM.
A conventional SVM classifier and a customized implementation of the logistic regression (LR) classifier were used on all feature representations being compared. The behaviors of different structural risk minimization schemes were investigated over LR on the OB representation. The following logistic regressions were analyzed: l1 regularized LR (LR1), l1/l2 regularized LR (LRG) and l1/l2+l1 regularized LR (LRG1).
The implementation details are as follows:
Also shown in
Improved performance was shown on three out of four datasets (
The classification performance of using the detected object location and its detection score of each object detector as the image representation was also evaluated. The classification performance of this representation is 62.0%, 48.3%, 25.1% and 54% on the 15 scene, LabelMe, UIUC-Sports and MIT-Indoor datasets respectively.
The spatial structure and semantic meaning encoded in OB of the present invention by using a “pseudo” OB (
The reported state of the art performances were compared to the OB algorithm (using a standard LR classifier) as shown in Table 1 for each of the existing scene datasets (UIUC-Sports, 15-Scene and MIT-Indoor). Other algorithms use more complex model and supervised information whereas the results from the present invention are obtained by applying a relatively simple logistic regression.
OB is constructed from the responses of many objects, which encodes the semantic and spatial information of objects within images. It can be naturally applied to object recognition task.
The object recognition performance on the Caltech 256 dataset is compared to a high-level image representation obtained as the output of a large number of weakly trained object classifiers on the image. By encoding the spatial locations of the objects within an image, OB (39%) significantly outperforms the weakly trained object classifiers (36%) on the 256-way classification task where performance is measured as the average of the diagonal values of a 256×256 confusion matrix.
It should be appreciated by those skilled in the art that the specific embodiments disclosed above may be readily utilized as a basis for modifying or designing other image processing systems and methods. It should also be appreciated by those skilled in the art that such modifications do not depart from the scope of the invention as set forth in the appended claims.
The current application is a continuation of U.S. patent application Ser. No. 15/004,831 entitled “Method for Implementing a High-Level Image Representation for Image Analysis” to Li et al., filed Jan. 22, 2016, which is a continuation of U.S. patent application Ser. No. 12/960,467 entitled “Method for Implementing a High-Level Image Representation for Image Analysis” to Li et al., filed Feb. 22, 2011. The disclosures of U.S. patent application Ser. No. 15/004,831 and Ser. No. 12/960,467 are hereby incorporated by reference in their entirety.
This invention was made with Government support under contract 1000845 awarded by the National Science Foundation. The Government has certain rights in this invention.
Number | Date | Country | |
---|---|---|---|
Parent | 15004831 | Jan 2016 | US |
Child | 15289037 | US | |
Parent | 12960467 | Feb 2011 | US |
Child | 15004831 | US |