The invention concerns a method for computing a fingerprint of a video sequence.
The invention concerns mainly the field of content identification and authentication.
New multimedia database search tools require rapidity, editing, resistance and copyright enforcement/protection. Fingerprinting is largely used to search image/video content in large multimedia databases. A fingerprinting design extracts discriminating features, called fingerprints, typical for each image/video and thus specific to each image/video. The fingerprint can also be called image DNA or video DNA.
The main application is image/video copy detection and/or monitoring. Security and output bit length are the fingerprinting weaknesses.
The visual hash is a fingerprinting technique with crypto-system constraints. A visual hash function computes a unique, constant bit length, condensed version of the content, called also visual digest. A small change of the image/video leads to a small change of the visual digest. Visual hash is used in multimedia content authentication.
None of the known techniques combine efficient key frame detection (shot boundaries, stable frames) and efficient key frame description.
One aim of the invention is to propose a video fingerprinting process which combines visual hash and local fingerprinting.
To this end, the invention concerns a method for computing a fingerprint of a video sequence. According to the invention, the method comprises the steps of
The proposed fingerprint can be resistant against temporal cropping, spatial cropping, change of frame rate, geometrical manipulations (scaling, rotation), luminance changes . . . . The video fingerprint is easy to compute (time demanding context) and collision resistant.
According to a preferred embodiment, the computation of a visual digest for each frame is based on the computation of a visual hash function.
The invention proposes an innovative video content identification process which combines a visual hash function and a local fingerprinting. Thanks to a visual hash function, one can observe the video content variation and detect key frames. A local image fingerprint technique characterizes the detected key frames. The set of local fingerprints computed on key frames for the whole video summarizes the video or fragments of the video.
According to a preferred embodiment, the said visual hash function comprises the sub-steps of
the global visual digest of said frame being the set of all the visual digests of the different angular orientations.
Preferentially, the visual digest for one orientation is based on the luminance of the pixels in the strip of said orientation.
Advantageously, the step of detecting the shots comprise the sub-steps of:
Preferentially, the shots are defined as being the frames which distance with their neighbor frame is lower than the maximum of the said two thresholds.
According to a preferred embodiment, the step of detection of a stable frame comprises the sub-steps of:
Preferentially, the step of calculation of a local fingerprint for said stable frame comprises the sub-steps of
In an advantageous manner, the computation of a descriptor comprises the following sub-steps
Preferentially, the local fingerprint of a stable frame is the set of the descriptors of all the interest points of said stable frame.
According to a preferred embodiment, the video fingerprint is the set of the local fingerprints of all the stable frames.
Other characteristics and advantages of the invention will appear through the description of a non-limiting embodiment of the invention, which will be illustrated, with the help of the enclosed drawings.
Embodiments of the present invention may be implemented in software, firmware, hardware or by any combination of various techniques. For example, in some embodiments, the present invention may be provided as a computer program product or software which may include a machine or computer-readable medium having stored thereon instructions which may be used to program a computer (or other electronic devices) to perform a process according to the present invention. In other embodiments, steps of the present invention might be performed by specific hardware component that contain hardwired logic for performing the steps, or by any combination of programmed computer components and custom hardware components.
Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (for instance a computer). These mechanisms include, but are not limited to, floppy diskettes, optical disks, hard disk drives, holographic disks, compact disks read-only memory (CD-ROMs), magneto-optical disks, read-only memory (ROMs), random access memory (RAM), Erasable Programmable Read-only memory (EEPROM), magnetic or optical cards, flash memory, a transmission over the Internet, electrical, optical, acoustical or other forms of propagated signals (for instance carrier waves, infrared signals, digital signals, etc), or the like.
Unless specifically stated otherwise as apparent from the following discussion, it is appreciated that discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or the like, may refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
In the following detailed description of the embodiments, reference is made to the accompanying drawings that show, by way of illustration, specific embodiments in which the invention may be practices. In the drawings, like numeral describe substantially similar components throughout the several views. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. Other embodiments may be utilized and structural, logical, and electrical changes may be made without departing from the scope of the present invention. Moreover, it is to be understood that the various embodiments of the invention, although different, are not necessarily mutually exclusive. For example, a particular feature, structure, or characteristic described in one embodiment may be included within other embodiments.
This invention can be used for many purposes.
A first application is to retrieve a video or a fragment of video in a database using video fingerprint as a database index.
A second application is to identify, in real time (or closed to), a video stream by matching the current fingerprints of the current video with all fingerprints of a video database even if strong distortions are applied to an original content. It is a copy detection application.
A third application is to find a model of distortion applied to an original movie. By matching the local fingerprints of an original copy and a pirate copy, a model of distortion is computed. It is a co-registration application. This co-registration technique can be used in an informed watermarking algorithm.
Of course this is a non-limiting list of applications of the invention and other applications can be envisaged.
On
Each video comprises a set of frames for which a fingerprint is to be calculated. During a step E1, a visual hash function is applied on each of the video frames.
A hash function, used in signature generation, computes a unique condensed version of data, a bit stream summary, called message digest. What is called visual hash function is a perceptual hash function designed for images and video contents. Visual hash function uses fingerprinting techniques with crypto-system constraints. A visual hash function verifies:
A one-way function or cryptographic hash function f has the property “ease of computation”: for every input x (from domain of f) f(x) is ‘easy’ to compute.
A hash function f maps an input x of arbitrary bit length to an output f(x) of fixed bit length.
Given any image y, for which there exists an x with f(x)=y, it is computationally infeasible to compute any pre-image x′ with f(x)=y.
Given any pre-image x it is computationally infeasible to find a 2nd pre-image x′≠x with f(x)=f(x). Two pre-images x, x′ are different if and only if their contents are different.
The image f(x) must be resistant and robust, i.e. shall remain the same before and after attacks, if these attacks do not alter the perceptive content. f(x)≈f(x′) if x≈x′. x≈x′ means that x′ is a version of x (same visual content).
For image/video application, the image f(x) is called visual digest. A small change of a content leads to a small change of the visual digest. A high change of the content leads to a high change of the visual digest.
An image visual digest can be:
A video visual digest can be:
When the visual digest of each of the frames of the video is obtained, a set of key frames is obtained, the key frames being the shot boundaries and the stable frames. The stable frames are frames with the smallest content variation along a shot. For such a frame, the content distance between this frame and its neighbor frames is very low.
Then, a local fingerprint is calculated on each of the stable frames of the video. This local fingerprint represents the fingerprint of the shot the frame belongs to. A set of video fingerprints is thus calculated as the set of the fingerprints of the detected shots of the video.
Step E1: Visual digest construction by the hash function:
The visual digest is based on pseudo variance of the luminance of the points selected on a polar frame representation. On
The set of points on a line passing through the image center, with the orientation θ, are selected. On
Only the p-axis is used to characterize the pixels of each angular orientation θ. For an angular orientation θ, a point (x,y) is characterized by the couple (p,θ).
For a strip width of 1, a point (x,y) is a selected point if its coordinate p satisfies:
−0.5≦p−p′≦0.5 (1)
With (p′,θ) the coordinates of the middle point (x′,y′) for a same given θ.
So the equation (1) becomes:
−0.5≦(x−x′)cos θ+(y−y′)sin θ≦0.5 (2)
For a strip width of η, we extend (2) to:
−η/2≦(x−x′)cos θ+(y−y′)sin θ≦η/2 (3)
The preferred embodiment takes care of the importance of a point in the image. The importance of a point (x,y) in an image can be weighted by the relative position of this point (x,y) to the center (x′,y′). The distance r of a point (x,y) from the ellipse (
Only pixels belonging to the ellipses are selected among the pixels previously selected. The characteristics of the ellipse are its width and its height, which are the width and height of the frame. The equation of the ellipse being
If a discretization of 1° angle is selected, the visual digest is composed of 180 elements. Each element of the visual digest is thus computed by:
I(p,θ) is the value of the selected point (p,θ) (for example the luminance of the pixel (p,θ))
The image visual digest of a frame i is:
VD(i)={Elt(θ)}θ=0 . . . 179
With card(VD(i))=180
In a step E2, the stable frames are detected.
The evolution of the visual digest distance over a group of frames allows the detection of key frames. A shot boundary is a brutal variation of the visual digest inside a group of frames. The set of frames between two shot boundaries is called a shot. The stable frame presents the smallest distance variation of visual digest inside a shot. The goal of this step is to extract stable frames. When a stable frame is detected, it can characterized by using local fingerprinting.
First of all, during this step, a shot boundary detection is performed.
An automatic threshold process determines brutal transitions along the video and detects shot boundaries. The global automatic threshold process is based on two thresholds:
Two frames with the same content have the same (or closed to) visual hash. Instead of the histogram, the visual hash function described previously is chosen to study the video content variation. Such an approach presents the additional advantage to be very fast and not CPU/memory demanding, in addition is the proposed visual digest more sensitive to small change thus more accurate. Thus, the visual digest distance measures the similarity between frames. Inside a window S, the shot boundary corresponds to the frame where a peak of the visual digest variation is detected. This frame is localized in the center of the window S. The shot boundary on a window S, denoted SB, is calculated as below:
SB=i|dist(VD(i),VD(i+1))>max(Tglobal(i, L1),Tlocal(i, L2))
And the set of shot boundaries along a video is denoted ShotBound
ShotBound={SB}
The cardinality of ShotBound depends on video activity.
The visual hash variation is very sensitive to noise and therefore one shot can comprise several non significant peaks. It is therefore necessary to filter these peaks in order to keep only significant maxima.
A first window S1 (the biggest compared to a second one S2) is used to analyze the activity distribution around a central point. S2 is much smaller and avoids having a second peak close to the first peak.
A pseudo-global threshold, denoted Tglobal(i, L1), is defined on a large sliding window S1 of size 2 L1+1, centered on frame i. Letting p(i) and σ(i) denote the mean and the variance of dist(VD(k),VD(k+1)) measured for all k in S1=[i−L1; i+L1]. We define our proposed pseudo-global threshold as:
T
global(i, L1)=μ(i)+α1·σ(i)
A local and adaptive threshold, denoted Tlocal(i, L2), is computed on a small sliding window S2 of size 2 L2+1, with L2<<L1, centered on frame i. Specifically, we have:
T
local(i, L2)=α2·dmax(i)
In the preferred embodiment, L1=20, L2=12, α1=3 and α2=2.
Once the shot boundary is done, the stable frames are detected.
A stable frame is the frame with the smallest content variation along a shot. For such a frame, the content distance between this frame and the neighbor frames is very low. Two frames with the same content have the same (or closed to) visual hash (global fingerprint in our case).
Inside a shot, for each group of 2 L3+1 frames, an average of the content image distance (at the position j) is given by:
The stable frame is the frame which provides the smallest average of content image distance within a shot. This stable frame must have well distributed content information.
We obtain one stable frame per shot.
The preferred value for L3 is 5.
In a step E3, a local fingerprint is computed for each of the detected stable frames.
The local fingerprint process is divided in two steps:
Interest Points Detection
The interest points detection is based on a Difference Of Gaussian. It consists in detecting repeatable key points. A key point is repeatable if its pixel location is stable. It must be resistant to scale change, rotation, filtering . . . A cascade of filtered images is used. The Gaussian kernel is a scale-space kernel candidate. The theoretical interest of such an approach is that the difference of two Gaussians with respective variances k·σ and σ is a very good approximation of the normalized Gaussian Laplacian:
G(x,y,kσ)−G(x,y,σ)≈(k−1)σ2∇2G (5)
The convolution of (5) with the image leads to the Difference of Gaussians function (DOG image):
D(x,y,σ)=(G(x,y,kσ)−G(x,y,σ))*I(x,y) (6)
D(x,y,σ) represents the DOG function;
I(x,y) represents the luminance of pixels of coordinates (x,y).
Therefore, for an input image, cascades of filtered images are built that are called “octaves”, and then the difference of Gaussians for each octave is computed. The extrema of the DOGs Gaussians represent potential locations of interest points. Not all of these locations contain relevant information. Thus, a further threshold is performed on these points. Only the points with good contrast and precise space localization are kept. The locations of the detected key points in the scale space are then stored.
Local Description of the Interest Points
In the previous step, interest points detection is performed and their localization in the scale-space is stored. In this step, each interest point is characterized by computing a local descriptor. The descriptor must be both discriminant and invariant to a certain number of transformations. A discriminant descriptor is a descriptor which provides representative and different values for each different content. The description of an interest point must be independent from the description of other interest points. An efficient descriptor allows a correct matching in a large database of descriptors with high probability.
To compute a descriptor a circular neighborhood of radius β is considered in order to be invariant to rotation. The interest point orientation is computed, KO(x,y), by summing the gradient vectors in a small disc Disc(x,y) around the interest point.
The orientation of the resulting vector gives the interest point orientation (
The gradient of a pixel (x′,y′) is given by its magnitude and its orientation:
Where Lx and Ly are the derivatives of the function L(x,y) with respect to x and y dimensions respectively. And where L(x,y) is the Gaussian image:
L(x,y,σ)=G(x,y,σ)*I(x,y)
According to the interest point orientation KO(x,y) defined previously, a neighboring disc of radius R (preferably equal to 20) centered on (x,y) is then divided into nine regions (
With:
For each of the nine regions, we compute a local histogram of the gradients orientations of sixteen bins.
Histo(i,φ,k)=#Pixel(x″,y″)|orientation(x″,y″)=φ,(x″,y″)εR(i,k)
The final descriptor of an interest point k (KD) is the concatenation of the nine histograms.
KD(k)=[Histo(i,φ,k)]i=1 . . . 9
The Local Fingerprint of a stable frame i, called also shot fingerprint (SF), is the set of all KD:
SF(i)={KD(k)}
The cardinality of SF(i) depends on image activity.
The Video Fingerprint VF is the set of shot fingerprints SF:
Number | Date | Country | Kind |
---|---|---|---|
06290100.4 | Jan 2006 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2007/000334 | 1/16/2007 | WO | 00 | 7/16/2008 |