The present disclosure relates to video conferencing and more particularly to determining a horizontal gaze of a person involved in a video conferencing session.
Face detection in video conferencing systems has many applications. For example, perceptual quality of decoded video under a given bit-rate budget can be improved by giving preference to face regions in the video coding process. However, face detection techniques alone do not provide any indication as to the horizontal gaze of a person. The horizontal gaze of a person can be used to determine “who is looking at whom” during a video conferencing session.
Gaze estimation techniques heretofore known were generally developed to aid human-computer interaction. As a result, they commonly rely on accurate eye tracking, either using special and extensive hardware to track optical phenomena of eyes or involving computer vision techniques to map eyes with an abstracted model. Performance of eye mapping techniques is generally poor due to the difficulty of accurate eyeball location and tracking detection and the computation complexity those processes require.
Accordingly, techniques are desired for estimating in real-time the horizontal gaze of a person or persons involved in a video conference session.
Techniques are described herein to determine the horizontal gaze of a person from a video signal generated from viewing the person with at least one video camera. From the video signal, a head region of the person is detected and tracked. The dimension and location of a sub-region within the head region is also detected and tracked from the video signal. An estimate of the horizontal gaze of the person is computed from a relative position of the sub-region within the head region.
Referring first to
Endpoint 100(1) comprises a video camera cluster shown at 110(1) and a display 120(1) comprised of multiple display panels (segments or sections) configured to display the image of a corresponding person. Endpoint 100(2) comprises a similarly configured video camera cluster 110(2) and a display 120(2). Each video camera cluster 110(1) and 110(2) may comprise one or more video cameras. Video camera cluster 110(1) is configured to capture into one video signal or several individual video signals each of the participating persons A-E in group 20 at endpoint 100(1), and video camera cluster 110(2) is configured to capture into one video signal or several individual video signals each of the participating persons G-L in group 30 at endpoint 100(2). For example, there may be a separate video camera (in each video camera cluster) directed to a corresponding person position around a table. Not shown for reasons of simplicity in
As indicated above, the display 120(1) comprises multiple display sections or panels configured to display in separate display sections a video image of a corresponding person, and more particularly, a video image of a corresponding person in group 30 at endpoint 100(2). Thus, display 120(1) comprises individual display sections to display corresponding video images of persons G-L (shown in phantom), derived from the video signal output generated by video camera cluster 110(2) at endpoint 100(2). Similarly, display 120(2) comprises individual display sections to display corresponding video images of persons A-G (shown in phantom), derived from the video signal output generated by video camera cluster 110(1) at endpoint 100(1).
Moreover,
The head rectangle and the ENM rectangle each have a horizontal center point. In
A measurement distance d is defined as the distance between the horizontal centers of the head rectangle and the ENM rectangle within it. Another measurement r is defined as a “radius” (½ the horizontal side length) of the head rectangle. Contrasting
Referring again to
α=arcsin(d/r) (1)
where d are defined as explained above.
The actual viewing angle in
Reference is now made to
Turning now to
Each endpoint 100(1) and 100(2) can simultaneously serve as both a source and a destination of a video stream (containing video and audio information). Endpoint 100(1) comprises the video camera cluster 110(1), the display 120(1), an encoder 130(1), a decoder 140(1), a network interface and control unit 150(1) and a controller 160(1). Similarly, endpoint 100(2) comprises the video camera cluster 110(2), the display 120(2), an encoder 130(2), a decoder 140(2), a network interface and control unit 150(2) and a controller 160(2). Since the endpoints are the same, the operation of only endpoint 100(1) is now briefly described.
The video camera cluster 110(1) captures video of one or more persons and supplies video signals to the encoder 130(1). The encoder 130(1) encodes the video signals into packets for further processing by the network interface and control unit 150(1) that transmits the packets to the other endpoint device via the network 170. The network 170 may consist of a local area network and a wide area network, e.g., the Internet. The network interface and control unit 150(1) also receives packets sent from endpoint 100(2) and supplies them to the decoder 140(1). The decoder 140(1) decodes the packets into a format for display of picture information on the display 120(1). Audio is also captured by one or more microphones (not shown) and encoded into the stream of packets passed between endpoint devices. The controller 160(1) is configured to perform horizontal gaze analysis of the video signals produced by the video camera cluster 110(1) and from the decoded video signals that are derived from video captured by video camera cluster 110(2) and received from the endpoint 100(2). Likewise, the controller 160(2) at endpoint 100(2) is configured to perform horizontal gaze analysis of the video signals produced by the video camera cluster 110(2) and from the decoded video signals that are derived from video captured by video camera cluster 110(1) and received from the endpoint 100(1).
While
Turning now to
Turning to
At 220, the ENM sub-region within the head region is detected and its dimensions and location within the head region are tracked. The output of the function 220 is data for dimensions and relative location of an ENM sub-region (rectangle) within the head region (rectangle). Again, examples of the ENM sub-region (e.g., ENM rectangle) are shown at reference numerals 52 and 62 in
Using data representing the head region and the dimensions and relative location of the ENM sub-region within the head region, an estimate of the horizontal gaze, e.g., gaze angle α, is computed at 230. The computation for the horizontal gaze angle is given and described above with respect to equation (1) for the horizontal gaze of a person with respect to a video camera using the angles as defined in
At 250, a determination is then made as to at whom the person, whose head region and ENM sub-region is being tracked at functions 210 and 220, is looking. In making the determination at 250, other data and system parameter information is used, including face positions on the various display sections (at the local endpoint device and the remote endpoint device(s)), as well as display displacement distance from a video camera cluster to the face of a person (determined or approximated a priori, etc.).
Referring now to
As shown in
p(xn|xn−1)˜N(xn|xn−1,Λ) (2)
where xn−1, the state at the previous time step, is the mean and Λ=diag(σx2, σy2, σw2, σh2) is the covariance matrix for the multi-dimensional Gaussian distribution.
For each sample {xni}i=1N computed at 232, functions 234 and 236 are performed. Function 234 involves computing at least one image analysis feature of the ENM sub-region and comparing it with respect to a corresponding reference model. At function 236, importance weights are computed for a proposed (new) particle distribution based on the at least one image analysis feature computed at 234.
More specifically, at 234, one or several measurement models, also called a likelihood, is employed to relate the noisy measurements to the state (the ENM rectangle). For example, two sources of measurements (image features) are considered: color, yC, and edge features, yE. More explicitly, the normalized color histograms in the blue chrominance (Cb) and red chrominance (Cr) color domains and the vertical and horizontal projections of edge features are analyzed. To do so, a reference histogram or projection is generated, either offline using manually selected training data or online using a relatively coarse ENM detection scheme, such as those described in the aforementioned published patent applications, for a number of frames and computing a time average.
Denoting a reference histogram or projection as href and the histogram or projection for the region corresponding to the state x is hx, the likelihood model is defined as
for color histograms, and
for edge feature projections, where D(h1, h0) is the Bhattacharyya similarity distance, defined as
with B denoting the number of bins of the histogram or the projection.
At 236, the proposed distribution of new samples is computed. While the choice of the proposal distribution is important for the performance of the particle filter, one technique is to choose the proposed distribution as the state evolution model p(xn|xn−1). In this case, the particles, {xni}i=1N
ωni∝ωn−1ip(yC|xni)p(yE|xni). (6)
At 240, the weights are normalized such that
At 242, a re-sampling function is performed at each time step to compute a new (re-sample) distribution by multiplying particles with high importance weights and discarding or de-emphasizing particles with low importance weights, while preserving the same number of samples. Without re-sampling, a degeneracy phenomenon may occur, where the concentration of most of the weight on a single particle may occur that dramatically degrades the sample-based approximation of the filtering distribution.
At 244, an updated state representing the dimensions and location of the ENM sub-region within the head region, f({xni, ωni}i=1N
or the weighted average of the first few particles that have the highest importance weights. The updated state may be computed at 244 after determining that the state is stable. For example, the state may be said to be stable when it is determined that the weighted mean square error of the particles, varn, as denoted in equation (7) below, is less than a predetermined threshold value for at least one video frame. There are other ways to determine that the state is stable, and in some applications, it may be desirable to compute an update to the state even if it is not stable.
The particle filtering method to determine the dimensions and location of the ENM sub-region within the head region can be summarized as follows.
With {xn−1i, ωn−1i}i=1N
FOR i=1:Ns
END FOR
Normalize weights {ωni}i=1N
Re-sample.
The horizontal gaze analysis techniques described herein provide gaze awareness of multiple conference participants in a video conferencing session. These techniques are useful in developing value added features that are based on a better understanding of an ongoing telepresence video conferencing session. The techniques can be executed in real-time and do not require special hardware or accurate eyeball location determination of a person.
There are many uses for the horizontal gaze analysis techniques described herein. One use is to find a “common view” of a group of participants. For example, if a first person is speaking, but several other persons are seen to change their gaze to look at a second person's reaction (even though the second person may not be speaking at that time), the video signal from the video camera cluster can be selected (i.e., cut) to show the second person. Thus, a common view can be determined while displaying video images of each of a plurality of persons on corresponding ones of a plurality of video display sections, by determining towards which of the plurality of persons a given person is looking from the estimate of the horizontal gaze of the given person. Another related application is to display the speaking person's video image on one screen (or on one-half of a display section by cropping the picture) and the person at whom the speaking person is looking on an adjacent screen (or the other half of the same display section). In these scenarios, the gaze or common view information is used as input to the video switching algorithm.
The way to handle the situation of people looking in different directions depends on the application. In the video switching examples, the conflict could be resolved by giving a preference to the “common view” or the active speaker, or other pre-defined means of a “more important” person based on the context of the meeting.
Still another application is to fix eye gaze caused by moving eyeballs. The horizontal gaze analysis techniques described herein can be used to determine that a person's gaze is not “correct” because the person is looking at a display screen or section but is being captured by a video camera that is not above the display screen or section. Under these circumstances, processing of the video signal for that person can be artificially compensated to “move” or adjust that person's eyeball direction so that it appears as if he/she were looking in the correct direction.
Yet another application is to fix eye gaze by switching video cameras. Instead of artificially moving the eyeballs of a person, a determination is made from the horizontal gaze of the person as to which display screen or section he/she is looking at, and a video signal from one of a plurality of video cameras is selected, e.g., the video camera co-located with that display screen or section for viewing that person.
Still another use is for massive reference memory indexing. Massive reference memory may be exploited to improve prediction-based video compression by providing a well matching prediction reference. Applying the horizontal gaze analysis techniques described herein can facilitate the process of finding the matching reference. In searching through massive memory, for example, it might be that frames that have similar eye gaze (and head positions) provide good matches and can be considered as a candidate of prediction reference to improve video compression. Further search can then be focused on such candidate frames to find the best matching prediction reference, hence accelerating the process.
Although the apparatus, system, and method are illustrated and described herein as embodied in one or more specific examples, it is nevertheless not intended to be limited to the details shown, since various modifications and structural changes may be made therein without departing from the scope of the apparatus, system, and method and within the scope and range of equivalents of the claims. Accordingly, it is appropriate that the appended claims be construed broadly and in a manner consistent with the scope of the apparatus, system, and method, as set forth in the following claims.