The invention relates to a method and a device for generating a saliency map for a picture of a sequence of pictures.
Salient visual features that attract human attention can be important and powerful cues for video analysis and processing including content-based coding, compression, transmission/rate control, indexing, browsing, display and presentation. State of art methods for detecting and extracting visually salient features mainly handle still pictures. The few methods that handle sequences of pictures first compute spatial and temporal saliency values independently and then combine them in some rather arbitrary manners in order to generate a spatio-temporal saliency value. The spatial saliency values are generally based on the computation, in some heuristic ways, of the contrasts of various visual features (intensity, color, texture, etc.). These methods often assume that temporal saliency value relates to motion. Therefore, they first estimate motion fields using state of art motion estimation methods and then compute the temporal saliency values as some heuristically chosen functions of the estimated motion fields.
These methods have many drawbacks. First, accurate estimation of motion fields is known to be a difficult task. Second, even with accurate motion fields, the relationship between these motion fields and temporal saliency values is not straightforward. Therefore, it is difficult to compute accurate temporal saliency values based on estimated motion fields. Third, assuming spatial and temporal saliency values can be correctly computed, the combination of these values is not straightforward. State-of-art methods often weight temporal and spatial saliency values in an arbitrary manner to get a global value of spatio-temporal saliency value which is often not accurate.
The object of the invention is to resolve at least one of the drawbacks of the prior art. The invention relates to a method for generating a saliency map for a picture of a sequence of pictures, the picture being divided in blocks of pixels. The method comprises a step for computing a saliency value for each block of the picture. According to the invention, the saliency value equals the self information of the block, the self information depending on the spatial and temporal contexts of the block.
Preferentially, the self information is computed based on the probability of observing the block given its spatial and temporal contexts, the probability being the product of the probability of observing the block given its spatial context and of the probability of observing the block given its temporal context.
According to one preferred embodiment, the probability of observing the block given its spatial context is estimated as follows :
associate to each block of the picture a set of K ordered coefficients, with K a positive integer, the set of coefficients being generated by transforming the block by a first predefined transform;
estimate, for each coefficient of order k, its probability distribution within the image, k ε [1; K]); and
compute the probability of observing the block given its spatial context as the product of the probabilities of each coefficient of the set associated to the block.
Preferentially, the first predefined transform is a two-dimensional discrete cosine transform.
Advantageously, the probability of observing the block given its temporal context is estimated based on the probability of observing a first volume comprising blocks co-located to the block in the N pictures preceding the picture where the block is located, called current picture, and on the probability of observing a second volume comprising the first volume and the block, with N a positive integer. Preferentially, the probability of observing the first volume is estimated as follows :
associate a set of P ordered coefficients to each volume comprising the blocks co-located to one of the block of the current picture in the N pictures preceding the current picture, with P a positive integer, the set of coefficients being generated by transforming the volume by a second predefined transform;
estimate, for each coefficient of order p, its probability distribution, p ε [1; P]; and
compute the probability of observing the first volume as the product of the probabilities of each coefficient of the set associated to the first volume.
Preferentially, the probability of observing the second volume is estimated as follows :
associate a set of Q ordered coefficients to each volume comprising one of the block of the current picture and the blocks co-located to the block in the N pictures preceding the current picture, with Q a positive integer, the set of coefficients being generated by transforming the volume by the second predefined transform;
estimate, for each coefficient of order q, its probability distribution, q ε[1; Q]; and
compute the probability of observing the second volume as the product of the probabilities of each coefficient of the set associated to the second volume.
Advantageously, the second predefined transform is a three-dimensional discrete cosine transform.
The invention also relates to a device for generating a saliency map for a picture of a sequence of pictures, the picture being divided in blocks of pixels, comprising means for computing a saliency value for each block of the picture characterized in that saliency value equals the self information of the block, the self information depending on the spatial and temporal contexts of the block.
The invention also concerns a computer program product comprising program code instructions for the execution of the steps of the method of saliency maps computation as described above, when the the program is executed on a computer.
Other features and advantages of the invention will appear with the following description of some of its embodiments, this description being made in connection with the drawings in which:
The method according to the invention consists in generating a spatio-temporal saliency map as depicted on
In reference to
The uniqueness of a spatio-temporal event B(x, y, t) is affected by its spatial and temporal contexts. If an event is unique in the spatial context, it is likely that it is salient. Similarly, if it is unique in the temporal context it is also likely to be salient. Both the spatial context and the temporal context influence the uniqueness of a spatio-temporal event. Therefore, according to a first embodiment, a spatio-temporal saliency value ss8(B(x0,y0,0) is computed for a given block of pixels B(x0, Y0, t) in a picture F(t) as the amount of self information Ist(B(x0,y0, t) contained in the event B(x0, y0, t) given its spatial and temporal contexts. The self information isi(B(x0, y0, t)) represents the amount of information gained when one learns that B(x0, Y0, t) has occurred. According to the Shannon's information theory, the amount of self information Ist(B(x0,y0,t) is defined as a positive and decreasing function f of the probability of occurrence, i.e. Ist(B(x0, y0,t))=f (p(B(x0, y0,t)|V(x0, y0, t−1),F(t))), with f(1)=0, f(0)=infinity, and f(P(x)*P(y))=f(P(x))+f(P(y)) if x and y are two independent events. f is defined as follows: f(x)=log(1/x). According to the Shannon's information theory, the self information I(x) of an event x is thus inversely proportional to the likelihood of observing x.
The spatio-temporal saliency value SSS(B(x0, y0,t) associated to the block B(x0,y0,t) is therefore defined as follows: SSS(B(x0, y0,t))=Ist(B(x0, y0,t))=log(*(x0, y0,t−1), F(t))). The spatial context of the event B(x0, Y0,t) is the picture F(t). The temporal context of the event B(x0, y0, t) is the volume V(xo, yo, t-1), i.e. the set of blocks co-located with the block B(x0, Y0, t) and located in the N pictures preceding the picture F(t). A block in a picture F(t′) is co-located with the block B(x0, y0, t) if it is located in F(t′) at the same position (x0, y0) as the block B(x0, y0, t) in the picture F(t).
In order to simplify the computation of the saliency values, the spatial and temporal conditions are assumed to be independent. Therefore, the joint conditional probability p(B(x0, y0,t)|V(x0, y0,t−1), F(t)) may be rewritten as follows:
p(B(x0, y0, t)|V(x0, y0, t−1), F(t))=p(B(x0, y0t−1))*p(B(x0, y0, t)|F(t))
Therefore according to a preferred embodiment depicted on
SSS(B(x0, y0, t))=−log(p(B(x0y0, t)|V(x0y0, t−1)))−log(p(B(x0, y0, t)|F(t)))
In
The temporal conditional probability p(B(x0, y0, t)|V(x0, y0, t−1)) is estimated 10 from the probabilities of the volumes V(x0,y0,t) and V(x0, y0, t−1). Indeed,
(eq1). For the purpose of estimating the probabilities p(V(x0, y0, t)) and p(V(x0, y0, y−1)), the high dimensional data set V(x,y,t) is projected into an uncorrelated vector space. For example, if N=2, m=n=4, then V(x,y,t) ε R32, i.e. to a 32 dimensional vector space. Let ϕk, k=1, 2, . . . K, be a K orthogonal transform vector space basis. If V(x,y,t) ε R32, then K=32. The spatio-temporal probability p(V(x0, y0,t)) is thus estimated as follows:
Step 1: for each position (x,y), compute the coefficients ck(x,y,t) of V(x,y,t) in the vector space basis as follows: ck (x, y, t)=ϕkV(x,y,t) ∀x,y;
Step 2: estimate the probability distribution pk(c) of ck(x,y,t); and
Step 3: compute the probability p(V(x0, y0, t)) as follows:
p(V(x0, y0, t))=Πkpk(ϕkV(x0, y0, t)).
The same method is used to estimate the probability p(V(x0, y0, t−1)).
The temporal saliency value SSS,(B(x0, y0, t)) is then computed 20 from p(V(x0, y0, t)) and p(V(x0, y0, t−1) according to (eq1). A temporal saliency map is depicted on
The method described above for estimating the probability p(V(x0, y0, t)) is used to estimate 30 the probability p(B(x0, y0, t)). The spatial conditional probability p(B(x0, y0, t)|F(t)) is equivalent to p(B(x0, y0t)) since only the current frame F(t) influence the uniqueness of a spatio-temporal event B(x0, y0, t). Therefore, to estimate p(B(x0, y0, t)|F(t)) it is only required to estimate the probability of the spatio-temporal event B(x0, y0, t) against all the events in the picture F(t) as follows:
Step 1: for each position (x,y), compute the coefficients dk(x,y,t) of B(x,y,t) in the vector space basis as follows: dk(x,y,t)=ϕkB(x,y,t) ∀x,y;
Step 2: estimate the probability distribution of dk(x,y,t), pk(d); and
Step 3: compute the probability p(B(x0, y0, t)) as follows:
p(B(x0, y0, t))=Πkpk(ϕkB(x0, y0t))
Preferentially, a 2D-DCT (discrete cosine transform) is used to compute the probability p(B(x0, y0, t)) . Each 4×4 blocks B(x,y,t) in a current picture F(t) is transformed (step 1) in a 16-D vector (d0(x,y,t), d1(x,y,t), . . . , dk(x,y,t)). The probability distribution pk(d) is estimated (step 2) within the picture by computing an histogram in each dimension k. Finally, the multiple probability p(B(x0y0, t)) is derived (step 3) based on these estimated distributions as the product of the probabilities pk(ϕkB(x0, y0, t)) of each coefficient dk(x,y,t). The same method is applied to compute the probabilities p(V(x0,y0,t)) and p(V(x0,y0,t−1)). However, in this case a 3D DCT is applied instead of a 2D DCT, The method therefore enables for real time processing at a rate of more than 30 pictures per second for CIF format pictures. Besides, since the model is based on information theory, it is more meaningful than state of art methods based on statistics and heuristics. For example, if the spatio-temporal saliency value of one block is 1 and the spatio-temporal saliency value of another block is 2, then the first block is about twice important than the second one in the same situation. This conclusion cannot be drawn with spatio-temporal saliency maps derived with state of art methods.
The spatial saliency value SSSs(B(x0, y0, t)) is then computed 40 from the probability p(B(x0, y0,t) as follows: SSSs(B(x0, y0,t))=−log(p(B(x0, y0t))). A spatial saliency map is depicted on
The global saliency value SSSs(B(x0, y0, t) is finally computed 50 as the sum of the temporal and spatial saliency values.
In reference to
The saliency maps generated for the picture of a sequence of pictures can advantageously help video processing and analysis including content-based coding, compression, transmission/rate control, picture indexing, browsing, display and video quality estimation.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CN2006/002643 | 10/10/2006 | WO | 00 | 6/26/2012 |