In the field of image processing, visual attention refers to a process where some parts of an image receive more attention from the human brain and visual system. Many applications (i.e., computer software applications) such as automatic image cropping, adaptive image display, and image/video compression employ visual attention. Most existing visual attention approaches are based on a bottom-up computational framework that involves extraction of multiple low-level visual features in an image, such as intensity, contrast, and motion. These approaches may be effective in finding few fixation locations in images, but they have not been able to accurately detect the actual region of visual attention.
This summary is provided to introduce simplified concepts of salient object detection, described below in the Detailed Description. This summary is not intended to identify essential features of the claimed subject matter, nor is it intended for use in determining the scope of the claimed subject matter.
In an embodiment, a method is performed for salient object detection where an image is received; the image includes the salient object and a background. The salient object may be defined using various feature maps which are combined. The salient object may be detected using the combined features.
The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit of a component reference number identifies the particular figure in which the component first appears.
a and 6b show exemplary creations of center-surround histogram feature maps.
a shows an exemplary creation of a color spatial variance feature map.
b shows an exemplary performance evaluation curve for spatial variance feature map.
Systems and methods for detecting a salient object for studying visual attention in an image are described. In one implementation, the methods separate a distinctive foreground image or salient object, from the image background. To this end, the systems and methods identify the salient object in the image by describing the salient object locally, regionally, and globally using a set of visual features.
In an implementation, the visual features include multi-scale contrast, center-surround histogram, and color spatial distribution. A conditional random field learning approach is used to combine these features to detect the salient object present in the image. A large image database containing numerous carefully labeled images by multiple users is created for training the conditional random field. These and other aspects for detecting salient object in an image are now described in greater detail.
Salient Object Detection
The systems and methods for detecting a salient object in an image are described in the general context of computer-executable instructions being executed by a computing device, such as a personal computer. Computer instructions generally include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular abstract data types. While the systems and methods are described in the foregoing context, acts and operations described hereinafter can be implemented in hardware or other forms of computing platforms.
At block 104, the salient object in the image is defined using various feature maps. In an embodiment, three different feature maps may be used to define the salient object locally, regionally, and globally. For example, the local features of the salient object can be defined by a multi-scale contrast feature map, where image contrast is computed at multiple scales. The regional features of the salient object can be defined by a center-surround histogram feature map, where histogram distances with different locations and sizes are computed. The global feature of the salient object can be defined by a color-spatial variance feature map, where the global spatial distribution of a specific color is used to describe the saliency of the object.
At block 106, various feature maps are combined together by learned conditional random field (CRF). In CRF learning, an optimal linear combination of the feature maps under maximized likelihood (ML) criteria can be obtained.
At block 108, the salient object in the image is detected using the learned CRF. The detected salient object can then be used in various applications related to visual attention.
Exemplary Method for Conditional Random Field Learning
The detection of a salient object can be formulated as a binary labeling problem in which the salient object is separated from the image background. A salient object can be represented as a binary mask A={ax} in an image I. For each pixel x, axε{1, 0} is a binary label that indicates whether or not the pixel x belongs to the salient object.
At block 204, a saliency probability map, G={gx|gxε[0,1]}, of the salient object can be computed for each labeled image as:
where, M is the number of users involved in labeling, and Am={axm} is the binary mask labeled by the mth user.
At block 206, statistics for measuring labeling consistency are computed. The labeling of the images done by the multiple users may be consistent or inconsistent. Consistently labeled images are those in which a majority of the users identify a common object as the salient object. Inconsistently labeled images are those in which the multiple users may identify different objects as the salient object. Examples of consistent and inconsistent labeling are shown in
where Ct is the percentage of pixels whose saliency probabilities are above a given threshold t. For example, C0.5 is the percentage of the pixels agreed on by at least half of the users. C0.9≈1 indicates that the image is consistently labeled by all the users.
At block 208, a set of consistent images are selected from the image set AI based on the consistency statistics. For example, the consistent images may be the images for which C0.9>0.8. These consistent images form an image set BI, which can be used for training conditional random fields to detect salient objects. For this, a saliency probability map G, computed for a detected salient object mask A, can be used to define or describe region-based and boundary-based measurements. For region-based measurement, Precision, Recall, and F-measure can be used to determine the accuracy of salient object detection based on a labeled salient object.
Precision and Recall indicate a ratio of a correctly detected salient region to the labeled salient region, and can be determined as follows:
F-measure indicates the weighted harmonic mean of precision and recall, with a non-negative constant α and can be determined as follows:
In one implementation, the F-measure can be used as an overall performance measurement of labeling consistency, and α may be set as 0.5. For the boundary-based measurement, techniques for boundary displacement error (BDE) measurement known in the art are used to measure the average displacement error of corresponding boundaries of two rectangles that identify the salient object in the image set BI. The displacement of boundaries is averaged over the multiple users.
At block 210, different types of feature maps are determined for consistent images. In an implementation, three different types of feature maps are created for defining local, regional, and global features of a salient object. A multi-scale contrast feature map can be created for defining local features of the salient object, a center-surround histogram feature map can be created for defining regional features of the salient object, and a color spatial variance feature map can be created for defining global features of the salient object.
At block 212, conditional random fields are trained using the feature maps. In the Conditional Random Field or CRF model, the probability of the label A={ax} in the image I is modeled as a conditional distribution
where Z is the partition function. To detect a salient object, the energy E(A|I) is defined as a linear combination of a number of K salient features Fk(ax, I) and a pairwise feature S(ax, ax′, I). In one implementation, the energy E(A|I) is defined as follows:
where, λk is the weight of the kth feature, and x, x′ are two adjacent pixels.
In equation 6 above, the salient object feature, Fk(ax, I) indicates whether or not a pixel x belongs to the salient object. Each kind of salient object feature provides a normalized feature map fk(x, I)ε[0, 1] for every pixel. The salient object feature Fk(ax, I) can be determined as:
Furthermore, in equation 6, the pairwise feature, S(ax, ax′, I) models the spatial relationship between two adjacent pixels. S(ax, ax′, I) can be determined as:
S(ax,ax′,I)=|ax−ax′|·exp(−βdx,x′) (8)
where, dx,x′=∥Ix−Ix′∥ is the L2 norm of the color difference. β is a robust parameter that weights the color contrast and can be set as β=(2∥Ix−x′∥2)−1 in one implementation, where • is the expectation operator. This feature function can be considered to be a penalty term when adjacent pixels are assigned with different labels. The more similar the colors of the two pixels are, the less likely they are assigned different labels. With this pairwise feature for segmentation, the homogenous interior region inside the salient object can also be labeled as salient pixels.
According to an embodiment, in CRF learning the linear weights {right arrow over (λ)}={λk}k=1K are estimated under maximized likelihood (ML) criteria to get an optimal linear combination of object features in the feature maps. For any N training image pairs {In, An}n=1N, the optimal parameters maximize the sum of the log-likelihood as follows:
The derivative of the log-likelihood with respect to the parameter λk is the difference between two expectations:
Then, the gradient descent direction is given by:
where,
is the marginal distribution and p(axn|gxn) is from the following labeled ground-truth:
In one embodiment, a pseudo-marginal belief computed by belief propagation can be used to approximate the marginal distribution. For this, a tree-reweighted belief propagation can be run in a gradient descent to compute an approximation of the marginal distribution.
Exemplary Salient Object Detection and Labeling
Images 410 are examples of inconsistent images or images in which the salient object may not be consistently labeled by different users. The inconsistency in labeling may be due to multiple disjoint foreground objects in an image. For example, in image 412a, one user may consider the plant 414 as the salient object while labeling, whereas another user may consider the dog 416 as the salient object. Similarly, in image 412b, a user may consider the flower 418 and bee 420 jointly as the salient object, while another may consider the bee 420 alone as the salient object. The inconsistent images may not be considered for conditional random field or CRF learning, as described above with respect to
Salient Object Feature Maps
where, Il is the lth level image in the pyramid and L is the number of pyramid levels. In one implementation, L is set to 6 and N(x) is a 9×9 window. The feature map fc(., I) can be normalized to a fixed range [0, 1].
For an input image 502, multiple contrast maps can be generated at multiple scales, as shown in image 504. Image 506 shows the contrast feature map obtained from linearly combining the various contrast maps at multiple scales.
a and 6b show an exemplary creation of a center-surround histogram feature map. A salient object can have a larger extent than local contrast and can thus be distinguished from its surrounding context. For this, regional salient features of an image can be defined with the use of a center-surround histogram. For example, a labeled salient object enclosed by a rectangle R, a surrounding contour RS with the same area of R is constructed, as shown in image 602 in
Histograms can be used for global description of appearance of the image, since histograms are insensitive to small changes in size, shape, and viewpoint. The histogram of a rectangle with a location and size can be computed by means of integral histogram known in the art. For example,
In an implementation, varying aspect ratios of the object can be handled using five templates with different aspect ratios {0.5, 0.75, 1.0, 1.5, 2.0}. The most distinct rectangle R*(x) centered at each pixel x can be found by varying the size and aspect ratio:
In an implementation, the size range of the rectangle R(x) is set to [0.1, 0.7]×min(w, h), where w is image width and h is image height. Then, the center-surround histogram feature fh(x, I) is defined as a sum of spatially weighted distances:
where, R*(x) is the rectangle centered at x′ and containing the pixel x. The weight wxx′=exp(−0.5σx′−2|x−x′|2) is a Gaussian falloff weight with variance σx′2, which is set to one-third of the size of R*(x′). The feature map fh(•, I) is then normalized to the range [0, 1].
b shows an exemplary center-surround feature map computed for an input image. 606 is the input image containing the salient object, and 608 is the corresponding center-surround histogram feature map. Thus, the salient object can be highlighted by the center-surround histogram feature.
a shows an exemplary illustration 700 of creation of a color spatial variance feature map. In an image, the salient object may not contain the color which is widely distributed in the image. The global spatial distribution of a specific color can be used to describe the saliency of an object in the image. The spatial-distribution of a specific color can be described by computing the spatial variance of the color. In one implementation, all colors in the image are represented by Gaussian Mixture Models (GMM) represented as {wc, μcΣc}c=1C, where {wc, μc, Σc} are respectively the weight, the mean color and the covariance matrix of the cth component. Each pixel can be assigned to a color component with the probability:
Then, the horizontal variance Vh(c) of the spatial position for each color component c can be determined as:
where, xh is x-coordinate of the pixel x, and |X|c=Σx p(c|Ix). The vertical variance Vv(c) can also be determined in a similar manner. The spatial variance of a component c is defined as V(c)=Vh(c)+Vv(c). In an implementation, {V(c)}c is normalized to the range [0, 1] (V(c)←(V(c)−minc V(c))/(maxc V(c)−minc V(c))). The color spatial-distribution feature fs(x, I) can be defined as a weighted sum:
The feature map fs(•, I) is also normalized to the range [0, 1].
The spatial variance of the color at the image corners or boundaries may be small if the image is cropped from the whole scene. To reduce this artifact, a center-weighted, spatial-variance feature can be determined as:
where, D(c)=Σx p(c|Ix)dx is a weight which assigns less importance to colors nearby image boundaries and it is also normalized to [0, 1], similar to V(c). dx is the distance from pixel x to the image center.
In
b shows an exemplary performance evaluation plot 708 for spatial variance feature map. The effectiveness of color spatial variance feature on an image set, such as the image set AI can be determined by drawing a plot of the color spatial-variance on the x-coordinate versus average saliency probability on the y coordinate. The plot for image set AI is shown in
Condition Random Field Learning Using Feature Map Combinations
Condition random fields may be learned using different combinations of feature maps, as discussed above with respect to
Graph 802 shows evaluation of image set AI and graph 804 shows evaluation of image set BI. The horizontal axis is marked with numbers 1, 2, 3 and 4, where 1 refers to salient object detection by CRF learned from multi-scale contrast feature map, 2 refers to salient object detection by CRF learned from center-surround histogram feature map, 3 refers to salient object detection by CRF learned from color spatial distribution feature map, and 4 refers to CRF learned from a combination of all the three features.
In this example, as seen from graphs 802 and 804, the multi-scale contrast feature has a high precision but a very low recall. This may be because the inner homogenous region of a salient object has low contrast. The center-surround histogram has the best overall performance (on F-measure) among all individual features. This may be because the regional feature is able to detect the whole salient object, although the background region may contain some errors. The color spatial distribution has slightly lower precision but has the highest recall. Furthermore, in this example, after linearly combining all the three features by CRF learning, the CRF with all three features is found to produce the best overall result, as shown in the last bars in
Multiple Salient Object Detection
Exemplary Procedure
At block 1002, an image is received that contains a salient object or a distinctive foreground object. The receiving may be performed as part of a software application on a computing device.
At block 1004, the image is rescaled to a standard size. In an implementation, the image is resized so that the Max (width, height) of the image is 400 pixels, which can be used to set parameters while creating feature maps.
At block 1006, local features of the image are defined with the use of a multi-scale contrast feature map. The high contrast boundaries of the image are highlighted as explained with reference to
At block 1008, regional features of the image are defined with the use of a center-surround histogram feature map, as explained with reference to
At block 1010, global features of the image are defined with the use of a color spatial distribution feature map, as described with reference to
At block 1012, the salient object in the image is detected by learned CRF, as described with reference to
The above procedure for implementation has been described with respect to one embodiment of the system. It can be appreciated that the process can be implemented by other embodiments as well.
Exemplary Computing Environment
Computer environment 1100 includes a general-purpose computing-based device in the form of a computer 1102. Computer 1102 can be, for example, a desktop computer, a handheld computer, a notebook or laptop computer, a server computer, a game console, and so on. The components of computer 1102 can include, but are not limited to, one or more processors or processing units 1104, a system memory 1106, and a system bus 1108 that couples various system components including the processor 1104 to the system memory 1106.
The system bus 1108 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures can include an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnects (PCI) bus also known as a Mezzanine bus.
Computer 1102 typically includes a variety of computer readable media. Such media can be any available media that is accessible by computer 1102 and includes both volatile and non-volatile media, removable and non-removable media.
The system memory 1106 includes computer readable media in the form of volatile memory, such as random access memory (RAM) 1110, and/or non-volatile memory, such as read only memory (ROM) 1112. A basic input/output system (BIOS) 1114, containing the basic routines that help to transfer information between elements within computer 1102, such as during start-up, is stored in ROM 1112. RAM 1110 typically contains data and/or program modules that are immediately accessible to and/or presently operated on by the processing unit 1104.
Computer 1102 may also include other removable/non-removable, volatile/non-volatile computer storage media. By way of example,
The disk drives and their associated computer-readable media provide non-volatile storage of computer readable instructions, data structures, program modules, and other data for computer 1102. Although the example illustrates a hard disk 1116, a removable magnetic disk 1120, and a removable optical disk 1124, it is to be appreciated that other types of computer readable media which can store data that is accessible by a computer, such as magnetic cassettes or other magnetic storage devices, flash memory cards, CD-ROM, digital versatile disks (DVD) or other optical storage, random access memories (RAM), read only memories (ROM), electrically erasable programmable read-only memory (EEPROM), and the like, can also be utilized to implement the exemplary computing system and environment.
Any number of program modules can be stored on the hard disk 1116, magnetic disk 1120, optical disk 1124, ROM 1112, and/or RAM 1110, including by way of example, an operating system 1127, one or more application programs 1128, other program modules 1130, and program data 1132. Each of such operating system 1127, one or more application programs 1128, other program modules 1130, and program data 1132 (or some combination thereof) may implement all or part of the resident components that support the distributed file system.
A user can enter commands and information into computer 1102 via input devices such as a keyboard 1134 and a pointing device 1136 (e.g., a “mouse”). Other input devices 1138 (not shown specifically) may include a microphone, joystick, game pad, satellite dish, serial port, scanner, and/or the like. These and other input devices are connected to the processing unit 1104 via input/output interfaces 1140 that are coupled to the system bus 1108, but may be connected by other interface and bus structures, such as a parallel port, game port, or a universal serial bus (USB).
A monitor 1142 or other type of display device can also be connected to the system bus 1108 via an interface, such as a video adapter 1144. In addition to the monitor 1142, other output peripheral devices can include components such as speakers (not shown) and a printer 1146 which can be connected to computer 1102 via the input/output interfaces 1140.
Computer 1102 can operate in a networked environment using logical connections to one or more remote computers, such as a remote computing-based device 1148. By way of example, the remote computing-based device 1148 can be a personal computer, portable computer, a server, a router, a network computer, a peer device or other common network node, and the like. The remote computing-based device 1148 is illustrated as a portable computer that can include many or all of the elements and features described herein relative to computer 1102.
Logical connections between computer 1102 and the remote computer 1148 are depicted as a local area network (LAN) 1150 and a general wide area network (WAN) 1152. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets, and the Internet.
When implemented in a LAN networking environment, the computer 1102 is connected to a local network 1150 via a network interface or adapter 1154. When implemented in a WAN networking environment, the computer 1102 typically includes a modem 1156 or other means for establishing communications over the wide network 1152. The modem 1156, which can be internal or external to computer 1102, can be connected to the system bus 1108 via the input/output interfaces 1140 or other appropriate mechanisms. It is to be appreciated that the illustrated network connections are exemplary and that other means of establishing communication link(s) between the computers 1102 and 1148 can be employed.
In a networked environment, such as that illustrated with computing environment 1100, program modules depicted relative to the computer 1102, or portions thereof, may be stored in a remote memory storage device. By way of example, remote application programs 1158 reside on a memory device of remote computer 1148. For purposes of illustration, application programs and other executable program components such as the operating system are illustrated herein as discrete blocks, although it is recognized that such programs and components reside at various times in different storage components of the computing-based device 1102, and are executed by the data processor(s) of the computer.
Various modules and techniques may be described herein in the general context of computer-executable instructions, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that performs particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.
An implementation of these modules and techniques may be stored on or transmitted across some form of computer readable media. Computer readable media can be any available media that can be accessed by a computer. By way of example, and not limitation, computer readable media may comprise computer storage media and communications media. Computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
Alternately, portions of the framework may be implemented in hardware or a combination of hardware, software, and/or firmware. For example, one or more application specific integrated circuits (ASICs) or programmable logic devices (PLDs) could be designed or programmed to implement one or more portions of the framework
The above described systems and methods describe salient object detection. Although the systems and methods have been described in language specific to structural features and/or methodological operations or actions, it is understood that the implementations defined in the appended claims are not necessarily limited to the specific features or actions described. Rather, the specific features and operations of are disclosed as exemplary forms of implementing the claimed subject matter.
Number | Name | Date | Kind |
---|---|---|---|
5581637 | Cass et al. | Dec 1996 | A |
5978507 | Shackleton et al. | Nov 1999 | A |
6137904 | Lubin et al. | Oct 2000 | A |
6282317 | Luo et al. | Aug 2001 | B1 |
6470094 | Lienhart et al. | Oct 2002 | B1 |
6762769 | Guo et al. | Jul 2004 | B2 |
7203360 | Lee et al. | Apr 2007 | B2 |
7212668 | Luo et al. | May 2007 | B1 |
7440615 | Gong et al. | Oct 2008 | B2 |
7840059 | Winn et al. | Nov 2010 | B2 |
7864365 | Campbell et al. | Jan 2011 | B2 |
20020081033 | Stentiford | Jun 2002 | A1 |
20020154833 | Koch et al. | Oct 2002 | A1 |
20020164074 | Matsugu et al. | Nov 2002 | A1 |
20030026483 | Perona et al. | Feb 2003 | A1 |
20050047647 | Rutishauser et al. | Mar 2005 | A1 |
20050084136 | Xie et al. | Apr 2005 | A1 |
20050169529 | Owechko et al. | Aug 2005 | A1 |
20060098871 | Szummer | May 2006 | A1 |
20060182339 | Connell | Aug 2006 | A1 |
20060215922 | Koch et al. | Sep 2006 | A1 |
20080075361 | Winn et al. | Mar 2008 | A1 |
20080075367 | Winn et al. | Mar 2008 | A1 |
20080304740 | Sun et al. | Dec 2008 | A1 |
Number | Date | Country |
---|---|---|
08287258 | Nov 1996 | JP |
20010036581 | May 2001 | KR |
20050114817 | Dec 2005 | KR |
Number | Date | Country | |
---|---|---|---|
20080304740 A1 | Dec 2008 | US |