Content-based visualization and user-modeling for interactive browsing and retrieval in multimedia databases

Information

  • Patent Grant
  • 6792434
  • Patent Number
    6,792,434
  • Date Filed
    Friday, April 20, 2001
    23 years ago
  • Date Issued
    Tuesday, September 14, 2004
    20 years ago
Abstract
A method for visualizing multimedia objects assigns a feature vector to each multimedia object. The feature vector of each multimedia object is reduced to a location vector having a dimensionality of a display device. A cost function is evaluated to determine an optimal location vector for each multimedia object, and each multimedia object is displayed on a display device according to the optimal location vector. The reducing can use principle component analysis. In addition, a relevance score can be determined for each displayed multimedia object, and the multimedia objects can than be visually enhanced according to the relevance score.
Description




FIELD OF THE INVENTION




This invention relates generally to computer-based systems which provide access to multimedia databases, and more particularly to systems that visualize multimedia objects according to media characteristics.




BACKGROUND OF THE INVENTION




Traditional browsing and navigating in a large multimedia database, for example, image, video, or audio databases, is often disorienting unless a user can form a mental picture of the entire database. Content-based visualization can provide an efficient approach for browsing and navigating multimedia databases.




MEDIA FEATURES




Many browsing and retrieval systems are feature based. For example, color, texture and structure for images, color and motion for videos, ceptrum, pitch, zero crossing rate, and temporal trajectories for audio. Color is one of the most widely used features for content-based image/video analysis. It is relatively robust to background complication and independent of image size and orientation. Color histograms are the most commonly used color feature representation. While histograms are useful because they are relatively insensitive to position and orientation changes, they do not capture spatial relationship of color regions, and thus, color histograms have limited discriminating power.




One can also use color moments. There, the color distribution of an image is interpreted as a probability distribution, and the color distribution can be uniquely characterized by its moments. Characterizing a 1-D color distribution with the first three moments of color is more robust and more efficient than working with color histograms.




Texture refers to the visual pattern with properties of homogeneity that do not result from the presence of a single color or intensity. Texture contains important information about the arrangement of surfaces and the relationship of the surfaces to the surrounding environment. Texture can be represented by wavelets by processing an image into a wavelet filter bank to decompose the image into wavelet levels having a number of bands. Each band captures the feature of some scale and orientation of the original image. For each band, the standard deviation of wavelet coefficients can be extracted.




Structure is a more general feature than texture and shape. Structure captures information such as rough object size, structural complexity, loops in edges, etc. Structure does not require an uniform texture region, nor a closed shape contour. Edge-based structure features can be extracted by a so-called “water-filling algorithm,” see X. Zhou, Y. Rui and T. S. Huang, “


Water


-


filling algorithm: A novel way for image feature extraction based on edge maps,”


in Proc. IEEE Intl. Conf. On Image Proc., Japan, 1999, and X. S. Zhou and T. S. Huang, “


Edge


-


based structural feature for content


-


based image retrieval,”


Pattern Recognition Letters, Vol 22/5, April 2001. pp. 457-468.




SUMMARY OF THE INVENTION




The invention visualizes multimedia objects, such as multiple images, on an output devices based on media features such as color, texture, structure, audio ceptrum, textual semantics, or any combination thereof. The vizualization can use the actual objects, or visual icons representing the objects. The resulting arrangement of multimedia objects automatically clusters objects having similar features. An original high-dimensional feature space is reduced to display space, i.e., locations having coordinates x and y, by principle component analysis (PCA).




Furthermore, the invention provides a process that optimizes the display by maximizing visibility, while minimizing deviation from the original locations of the objects. Given the original PCA-based visualization, the constrained non-linear optimization process adjust the location and size of the multimedia objects in order to minimize overlap while maintaining fidelity to the original locations of the objects which are indicative of mutual similarities. Furthermore, the appearance of specific objects in the display can be enhanced using a relevancy score.




More particularly, the invention provides a method for visualizing image objects. The method assigns a feature vector to each image. The feature vector of each image is reduced to a location vector having a dimensionality of a display device. A cost function is evaluated to determine an optimal location vector for each image, and each image is displayed on a display device according to the optimal location vector. The reducing can use principle component analysis.











BRIEF DESCRIPTION OF THE DRAWINGS





FIG. 1

is a block diagram of a method for visualizing multimedia objects according to the invention;





FIG. 2

is a graph of a cost function of overlap; and





FIG. 3

is a graph of a cost function of deviation:











DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT




As shown in

FIG. 1

, our invention provides a method and system


100


for optimally visualizing multimedia objects, e.g., digital images, videos, or audio programs


101


on a two or three dimensional output device


150


. Our visualization is based on features


111


extracted from or associated with the multimedia object


101


. The features can include vision, audio, and semantic features, depending on the content of the multimedia. Our invention augments a user's perception of a multitude of multimedia objects in order to provide information that cannot be perceived by traditional browsing methods, where objects are usually displayed either randomly, or as a regular pattern of tiles.




We first extract features from the digital multimedia objects


101


stored in a multimedia database


102


using multimedia analysis


110


to produce a feature vector


111


for each object


101


. For image objects, we use the HSV color space due to its de-correlated and uniform coordinates. Therefore, a nine-dimensional color feature vector, i.e., three moments for each color, is extracted from every image in the database. Texture and structure features are respectively represented by ten wavelet moments and eighteen water-filling features. As stated above, each feature vector can also be audio features, or semantic features, or any combinations of these features.




Consequently, the dimensionality of the feature vectors


111


for the multimedia can be very large, e.g., hundreds or more. Therefore, we use principle component analysis (PCA) 120 to reduce the dimensionality of the feature vectors


111


. PCA is a very fast linear transformation that achieves the maximum distance preservation when projecting from high to low dimensional feature spaces. We project the feature vectors


111


to either 2D or 3D space depending on the display device


150


. This projection is applied to each feature vectors


111


to produce a corresponding two or three dimensional location


121


for each multimedia object. For example, each location


121


has (x, y), (x, y, z), (ρ, θ) or (ρ, θ, φ) coordinates. The mutual distance between object locations respects their feature similarity, thus clustering displayed objects or icons representing the objects


151


according to their features, see U.S. patent application Ser. No. 09/651,002 “Multi-User Interactive Picture Presentation System and Method,” Filed by Shen et al. on Aug. 29, 2000, incorporated herein by reference.




However, if many of the objects


101


are similar in terms of their features, then it is expected that there will be a considerable amount of overlap in the visualized objects or icons because the values of the coordinates of the reduced dimensional vectors


121


will be close to each other. This will make it difficult for the user to discern all of the objects at once. It is desired to maximize visibility. This can be done by moving the objects apart, and reducing the size of the objects or icons. However, we want to maximize visibility, while minimizing deviation from the original layout, while at the same time not have the objects so small that they become indiscernible.




Optimizing Lay-Out of Display




Therefore, we provide a technique


130


that optimally displays the objects


101


. We optimize


130


the location and sizes of the objects


101


using a non-linear cost function J. A number of factors are taken into consideration while minimizing the cost function J. The visibility (minimum overlap) among objects on the display device


130


is made as large as possible. The total deviation from the original location


121


on the display device


150


is at the same time made as small as possible. Furthermore, the location of each object must be on the display device, and the size of each object must be greater than a minimum size threshold. To maximize the total visibility, the objects are moved away from each other.




However, this increases the deviation of the objects from the original locations. Large deviations from the original location are undesirable when the original location of each object is important. Without increasing the total deviation, the object size can be reduced. Of course, the object size cannot be arbitrary small. Because increasing the object size will increase the amount of overlap, the initial object size is assumed to be the maximum size.




Therefore, our cost function J uses the following parameters. The number of objects that are optimized minimized is N. The original location


121


of each object i is denoted by, e.g., for a 2D display, by {x


i




0


, y


i




0


}, for i=1, . . . , N. Optimized locations


131


of the objects are denoted by {x


i


, y


i


}, for i=1, . . . , N. The maximum and minimum coordinates of the display device are [x


min


, x


max


, y


min


, y


max


]. For simplicity, the radius of each objects is r


i


, for i=1, . . . , N. The maximum and minimum object size, in terms of radius, are r


max


and r


min


in radius. The original object size is r


i


=r


max


, for i=1, . . . , N.




The cost function J that optimizes the visualization is a linear combination of two individual cost functions that take into account the factor mentioned above.








J=F


(


p


)+λ·


S·G


(


p


)  (1)






where F(p) is a cost function of total visibility, and G(p) is a cost function of the total deviation from the original location of the objects. In order to maximize the total visibility, the total amount of overlap of the objects is minimized. The value S is a scaling factor which brings the range of the cost function G(p) to the same range of the cost function F(p), and λ is a weight, i.e., λ≧0. When λ is 0, the deviation of the objects is not considered in visibility maximization. When λ is 1, the cost functions of visibility and deviation are equally weighted. When 0<λ<1, the minimization maximization of visibility is more important than the minimization of deviation, and vice versa for λ>1.




The cost function F(p) of total visibility is











F


(
p
)


=




i
=
1

N






j
=

i
+
1


N



f


(
p
)





,




(
2
)








where







f


(
p
)



=

{





1
-



-


u
2


σ
f








u
>
0





0



u

0




,



and







where






u

=


r
i

+

r
j

-





(


x
i

-

x
j


)

2

+


(


y
i

-

y
j


)

2



.









(
3
)













When u≦0, there is no overlap between the jth object and the ith object, and the cost is 0. When u>0, there is partial overlap between the ith object and the jth object. When u=2·r


max


, the ith object totally obscures the jth object.





FIG. 2

graphs the cost function a function of u. It is clear that with the increasing value of u (u>0), the cost of overlap is also increasing. The value σ


f


in equation (3) can be determined by setting T=0.95


201


when u=r


max




202


, that is










σ
f

=



-

u
2



ln


(

1
-
T

)







u
=

r
max



.





(
4
)













The cost function of total deviation is










G


(
p
)


=




i
=
1

N



g


(
p
)







(
5
)







g


(
p
)


=

1
-



-


v
2


σ
g









(
6
)













where ν={square root over ((x


i


−x


i





0


)


2


+(y


i


−y


i





0


)


2


)} is the deviation of the ith object at the optimized location (x


i


, y


i


)


131


from the ith object at the original locations (x


i




0


, y


i




0


)


111


.





FIG. 3

graphs the cost function g(p)


300


as a function of deviation. It is clear that with the increasing value of ν, the cost of deviation is also increasing. The value σ


g


in equation (6) is determined by setting T=0.95


301


when ν=maxsep. The value maxsep


302


can be set to 2·r


max


. Thus, the value σ


g


is










σ
g

=



-

v
2



ln


(

1
-
T

)







v
=

2
·

r
max









(
7
)













The value S in equation (1) is selected chosen to be (N−1)/2.




User Preference Clustering




In an optional step


160


, we can also display the objects


101


according to a preferred arrangement selected by a user of our invention. Here, the user


153


initially places a small number of objects at preferred locations


161


on the display device


151


. For example, the user places four objects at four separate arbitrary locations, e.g., objects with people, buildings, seascapes, and landscapes. More objects can be placed when the objects contain a mix of features, for example, people in front of buildings, or people on a lake or sea shore.




Given the preferred locations


161


of the user selected objects, we now place additional objects


101


on the display device


150


according to the user specified clustering.




We do this by determining a relative “feature” distance between the preferred clusters of objects. As stated above, a cluster can be as small as one object, although larger clusters will give a better result. The relative distance is expressed in terms of a weighting vector α


162


. The features in the weighting vector α are identical to those in the feature vectors


111


, although the weighting vector can have fewer features, for example, only visual features, or only visual and semantic feature, for example. We then apply


170


the weighting vector α


162


to each feature vector


111


before the PCA. This, surprisingly, will “skew” the optimized clustering towards the.




Determining the Weighting Vector α




We describe the estimation of α for visual feature only, e.g. color, texture, and structure, although it should be understood that any of the features


111


can be factored into α.




In this case, the weighting vector


162


is α={α


c





t





s


}


T


, where α


c


is the weight for color, α


t


is the weight for texture, and α


s


is the weight for structure. The number of objects in the preferred clustering is N, and X


c


is a L


c


×N matrix where the ith column is the color feature vector of the ith object, i=1, . . . , N, X


t


is the L


t


×N matrix, the ith column is the texture feature vector of the ith object, i=1, . . . , N, and X


s


is the L


s


×N matrix, the ith column is the structure feature vector of the ith object, i=1, . . . , N. The lengths of color, texture and structure features are L


c


, L


t


, and L


s


respectively. The distance, for example Euclidean or Hamming, between the ith object and the jth object, for i, j=1, . . . , N, in the preferred clustering is d


ij


.




We set the sum of α


c


, α


t


, α


s


to 1, and define the best “fit” measure to be minimized as a constraint least square optimization problem where p=2.










J
=




i
=
1

N







j
=
1


N




(


d
ij
p

-




k
=
1


L
c





α
c
p




&LeftBracketingBar;


X

c


(
i
)



(
k
)


-

X

c


(
j
)



(
k
)



&RightBracketingBar;

p



-




k
=
1


L
t





α
t
p




&LeftBracketingBar;


X

t


(
i
)



(
k
)


-

X

t


(
j
)



(
k
)



&RightBracketingBar;

p



-




k
=
1


L
s





α
s
p




&LeftBracketingBar;


X

s


(
i
)



(
k
)


-

X

s


(
j
)



(
k
)



&RightBracketingBar;

p




)

2









Let




(
8
)







V

(
ij
)

c

=




k
=
1


L
c





&LeftBracketingBar;


X

c


(
i
)



(
k
)


-

X

t


(
j
)



(
k
)



&RightBracketingBar;

p






(
9
)







V

(
ij
)

t

=




k
=
1


L
t





&LeftBracketingBar;


X

t


(
i
)



(
k
)


-

X

t


(
j
)



(
k
)



&RightBracketingBar;

p






(
10
)







V

(
ij
)

s

=




k
=
1


L
s





&LeftBracketingBar;


X

s


(
i
)



(
k
)


-

X

s


(
j
)



(
k
)



&RightBracketingBar;

p






(
11
)













and simplify to:









J
=




i
=
1

N






j
=
1

N




(


d
ij
p

-


α
c
p



V

(
ij
)

c


-


α
t
p



V

(
ij
)

t


-


α
s
p



V

(
ij
)

s



)

2







(
12
)













To minimize J, we take the partial derivatives of J relative to α


c




p


, α


t




p


, α


s




p


and set the partials to zero, respectively.




We set













J




α
c
p



=
0










J




α
t
p



=
0










J




α
s
p



=
0





(
13
)













We have:














i
=
1

N






j
=
1

N




V

(
ij
)


c
2


·

α
c
p




+




i
=
1

N






j
=
1

N




V

(
ij
)

c




V

(
ij
)

t

·

α
t
p





+




i
=
1

N






j
=
1

N




V

(
ij
)

c




V

(
ij
)

s

·

α
s
p






=




i
=
1

N






j
=
1

N




d
ij
2



V

(
ij
)

c








(
14
)











i
=
1

N






j
=
1

N




V

(
ij
)

c




V

(
ij
)

t

·

α
c
p





+




i
=
1

N






j
=
1

N




V

(
ij
)


t
2


·

α
t
p




+




i
=
1

N






j
=
1

N




V

(
ij
)

t




V

(
ij
)

s

·

α
s
p






=




i
=
1

N






j
=
1

N




d
ij
2



V

(
ij
)

t








(
15
)











i
=
1

N






j
=
1

N




V

(
ij
)

c




V

(
ij
)

s

·

α
c
p





+




i
=
1

N






j
=
1

N




V

(
ij
)

t




V

(
ij
)

s

·

α
t
p





+




i
=
1

N






j
=
1

N




V

(
ij
)


s
2


·

α
s
p





=




i
=
1

N






j
=
1

N




d
ij
2



V

(
ij
)

s








(
16
)













We let







A
=

[







i
=
1

N






j
=
1

N



V

(
ij
)


c
2










i
=
1

N






j
=
1

N




V

(
ij
)

c



V

(
ij
)

t










i
=
1

N






j
=
1

N




V

(
ij
)

c



V

(
ij
)

s












i
=
1

N






j
=
1

N




V

(
ij
)

c



V

(
ij
)

t










i
=
1

N






j
=
1

N



V

(
ij
)


t
2










i
=
1

N






j
=
1

N




V

(
ij
)

t



V

(
ij
)

s












i
=
1

N






j
=
1

N




V

(
ij
)

c



V

(
ij
)

s










i
=
1

N






j
=
1

N




V

(
ij
)

t



V

(
ij
)

s










i
=
1

N






j
=
1

N



V

(
ij
)


s
2







]


,





β
=



[




α
c
p






α
t
p






α
s
p




]







and






b

=

[







i
=
1

N






j
=
1

N




d
ij
2



V

(
ij
)

c












i
=
1

N






j
=
1

N




d
ij
2



V

(
ij
)

t












i
=
1

N






j
=
1

N




d
ij
2



V

(
ij
)

s







]













Equations 14-16 are simplified








A·β=b,


  (17)






β is calculated as a constrained least square problem. The weighting vector α


162


is determined by the p-th root of the β.




Location Transformation




In the preferred user specified arrangement, it is desired to have the additional objects to be located relative to the objects placed by the user. However, there is no express relationship between the user arrangement and the arrangement according to the PCA. Therefore, we apply a linear affine transform


170


to the locations


121


produced by the PCA


120


. The affine transformation takes care of difference in rotation, reflection, stretching, translation and shearing transformation between the PCA produced locations


121


, and the preferred clustering


161


.




However, due to the collinearity and ratio of distance preserving properties, the affine transformation does not align the additional objects with the user placed objects. Therefore, we also apply a non-linear transformation, e.g., a radial basis function, to the affine transformed locations to align the objects with the preferred locations.




Enhanced Visualization




As shown in

FIG. 1

, the system


100


according to our invention can also selectively enhance


180


a subset of the displayed objects based on relevance criteria


181


. The enhancement


180


increases the visual prominence of this subset of objects according to a relevance score


182


for each displayed object


101


. The enhancement can be done by increasing the size, contrast, or brightness of the objects as a function of their relevance score


182


.




The relevance


132


represents a “third” dimension of visualization, the location being the other two dimensions. The purpose of the relevance


132


is to increase the information flow, and enhance the user's perception of the displayed objects based on the current context.




In a typical scenario, a user issues an object-based query, e.g., “find me similar objects,” and the system retrieves the nearest-neighboring objects, as described above, and renders the top N objects, using the alpha-weighted feature vectors


111


. In this case, each object has an implicit equal “relevance,” or similarity to the query. It is desired to selectively enhance the visualization, so that displayed objects that have a greater contextual relevance become more visually prominent by, for example, proportionally increasing their sizes—in which case the relevance score becomes a constraint in the optimization


130


. Alternatively, the contrast or brightness of the objects can be modified, perhaps, independent of the optimization


130


, or iconic information can display the actual relevance, e.g., rank number, or symbolic graphics.




The invention is described in terms that enable any person skilled in the art to make and use the invention, and is provided in the context of particular example applications and their requirements. Various modifications to the preferred embodiments will be readily apparent to those skilled in the art, and the principles described herein may be applied to other embodiments and applications without departing from the spirit and scope of the invention. Thus, the present invention is not intended to be limited to the embodiments described herein, but is to be accorded with the broadest scope of the claims below, consistent with the principles and features disclosed herein.



Claims
  • 1. A method for visualizing a plurality of stored multimedia objects on a display device:placing a set of user selected multimedia objects at user selected locations on the display device; assigning a feature vector to each user selected multimedia object and stored multimedia object; reducing the feature vector to a location vector to a location vector having a dimensionality of the display device; evaluating a cost function to determine an optimal location vector for each multimedia object, and an optimal size for each multimedia object, wherein the cost function maximizes a total visibility of the plurality of stored multimedia objects and minimizes a total deviation between the location vectors and the optimal location vectors, and wherein the cost function J is a linear combination J of a first function F(p) and second cost function G(p) J=F(p)+λ·S·G(p)  where the first cost function maximizes the total visibility, and the second cost function minimizes the total deviation, S is a scaling factor, and λ is a weight; and displaying each stored multimedia object on a display device according to the optimal location vector and the optimal size and a similarity to the user selected multimedia objects placed on the display device.
  • 2. The method of claim 1 wherein the feature vector includes visual features.
  • 3. The method of claim 1 wherein the feature vector includes semantic features.
  • 4. The method of claim 1 wherein the feature vector includes audio features.
  • 5. The method of claim 1 wherein the feature vector includes motion features.
  • 6. The method of claim 1 wherein the feature vector includes color, texture, and structure features.
  • 7. The method of claim 1 wherein the feature vectors include ceptrum, temporal trajectories, pitch, and zero crossing rate features.
  • 8. The method of claim 1 wherein the location vector uses Cartesian coordinates.
  • 9. The method of claim 1 wherein the location vector uses polar coordinates.
  • 10. The method of claim 9 further comprising:reducing a size of particular stored multimedia objects to maximize the total visibility.
  • 11. The method of claim 10 wherein the reduced size is greater than a threshold minimum size.
  • 12. The method of claim 1 further comprising:determining a weighting vector from the set of user selected multimedia objects and the user selected location on the display device; and weighting each feature vector by the weighting vector before reducing the feature vector.
  • 13. The method of claim 13 further comprising:applying an affine transform to the locations before displaying each stored multimedia object on the display device according to the optimal location vector and the user selected locations on the display device.
  • 14. The method of claim 13 further comprising:applying a non-linear transformation to the affine transformed locations.
  • 15. The method of claim 1 further comprising:determining a relevance score for each displayed multimedia object; and enhancing each displayed multimedia object according to the relevance score.
  • 16. The method of claim 15 wherein a particular displayed multimedia object is enhanced by increasing a brightness of the particular displayed multimedia object.
  • 17. The method of claim 15 wherein a particular displayed multimedia object is enhanced by increasing a contrast of the particular displayed multimedia object.
  • 18. The method of claim 1 wherein principle component analysis is applied to each feature vector to reduce the dimensionality of the feature vector.
  • 19. The method of claim 1 further comprising:providing a clustering of the user selected multimedia objects placed by the user on the display device; extracting as a weighting vector from the clustering of the user selected multimedia objects; and displaying each stored multimedia object on the display device according to the optimal location vector and the optimal size, and the weighting vector to match the clustering of the user selected multimedia objects placed by the user on the display device.
US Referenced Citations (11)
Number Name Date Kind
5839103 Mammone et al. Nov 1998 A
5915250 Jain et al. Jun 1999 A
5918223 Blum et al. Jun 1999 A
5983251 Martens et al. Nov 1999 A
5987456 Ravela et al. Nov 1999 A
6173275 Caid et al. Jan 2001 B1
6240423 Hirata May 2001 B1
6243492 Kamei Jun 2001 B1
6400846 Lin et al. Jun 2002 B1
6597818 Kumar et al. Jul 2003 B2
6608923 Zhang et al. Aug 2003 B1
Non-Patent Literature Citations (5)
Entry
Hiroike et al., “Visualization of Information Spaces to Retrieve and Browsw Image Data”; Third International Conference on Visual Information Systems, Springer-Verlag, pp. 155-162, 1999.*
Hiroike et al., “Visualization for Similarity-Based Image Retrieval Systems”; pp. 171-176, 1999.
Hiroike et al., “Visualization of Information Spaces to Retrieve and Browse Image Data”; Third International Conference on Visual Information Systems, Springer-Verlag, pp. 155-162, 1999.
Musha et al., “An Interface for Visualizing Feature Space in Image Retrieval”; MVA'98, IAPR Workshop on Machine Vision Applications, 1998, pp. 447-450.
Yossi Rubner, “Perceptual Metrics for Image Database Navigation”; Dissertation submitted to Department of Computer Science of Stanford University, May, 1999.