The present invention relates to a method and arrangement for identifying virtual visual information in at least two images from a sequence of successive images of a visual scene comprising real visual information and said virtual visual information.
When capturing a real-world scene using one or more cameras, it is desirable to only capture the scene objects that are in fact present, and not presented there virtually e.g. by projection. An example may be a future video conferencing system for enabling a video conference between several people, physically located in several distinct meeting rooms. In such a system a virtual environment in which all participants are placed may be represented by projection on a screen or rendered onto one or more of the available visualization devices present in the real meeting rooms. To capture the needed information e.g. which persons are participating, their movements, their expressions, etc, such as to enable the rendering of this virtual environment, cameras are used which are placed in the different meeting rooms. However these camera's not only track the real people and objects in the rooms, but also the people and objects as virtually rendered e.g. on these large screens within these same meeting rooms. While the real people need of course to be tracked to enable a better videoconferencing experience, their projections should not, or should at least be filtered out in a subsequent step.
Possible existing solutions to this problem make use of fixed positioned visualization devices cooperating with calibrated cameras which can result in simple rules in order to filter out the unwanted visual information. This can be used for traditional screens, with fixed positions within the meeting rooms.
A problem with this solution is that this only works for relatively static scenes, which composition is known in advance. This solution also requires manual calibration steps, which present a drawback in these situations requiring easy deployability. Another drawback relates to the fact that, irrespective of the content, an area of the captured images, corresponding to the screen area of the projected virtual content, will be filtered out. While this may be appropriate for older types of screen, it may not be appropriate anymore for newer screen technologies such as e.g. translucent screens that only become opaque at certain areas when there is something that needs to be displayed e.g. in the event of display of a cut-out video of a person talking. In this case the area that is allocated as being ‘virtual’ for a certain camera is not so at all instances in time. Moving cameras are furthermore difficult to support using this solution.
An object of embodiments of the present invention is therefore to provide a method for identifying the virtual visual information within at least two images of a sequence of successive images of a visual scene comprising real visual information and said virtual visual information, but which does not present the inherent drawbacks of the prior art methods.
According to embodiments of the invention this object is achieved by the method comprising the steps of
performing feature detection on at least one of said at least two images,
determining the movement of the detected features between said at least two images, thereby obtaining a set of movements,
identifying which movements of said set pertain to movements in a substantially vertical plane, thereby identifying a set of vertical movements
relating the features pertaining to said vertical movements to said virtual visual information in said at least two images, such as to identify the virtual visual information.
In this way, detection of movements of features in a vertical plane will be used to identify virtual content of the image parts associated with these features. These features can be recognized objects, such as human beings, or a table, or a wall, a screen, a chair, or parts thereof such as mouths, ears, eyes, . . . . These features can also be corners, or lines, or gradients, or more complex features such as the ones provided by algorithms such as the well-known scale invariant feature transform algorithm.
As the virtual screen information within the meeting rooms will generally contain images of the meeting participants, which usually show some movements, e.g. by speaking, writing, turning their heads etc, and as the position of the screen can be considered as substantially vertical, detection of movements lying in a vertical plane, hereafter denoted as vertical movements, can be a simple way of identifying the virtual visual content on the images as the real movements of the real, thus non-projected people, are generally 3 dimensional movements, thus not lying in a vertical plane. The thus identified virtual visual information can then be further filtered out from the images in a next image or video processing step.
In an embodiment of the method the vertical movements are identified as movements of said set of movements which are related by a homography to movements of a second set of movements pertaining to said features, said second set of movements being obtained from at least two other images from a second sequence of images, and pertaining to the same timing instances as said at least two images of said first sequence of images.
As determining homographies between two sets of movements is a rather straightforward and simple operation, these embodiments allow for an easy detection of movements in a vertical plane. These movements generally correspond to movements projected on vertical screens, which are thus representative for movements of the virtual visual information.
The first set of movements are determined on the first video sequence, while the second set of movements is either determined from a second sequence of images of the same scene, taken by a second camera, or, alternatively from a predetermined sequence only containing the virtual information. This predetermined sequence may e.g. correspond to the sequence to be projected on the screen, and may be provided to the arrangement by means of a separate video or TV channel.
By comparing the movements of the first sequence with these of the second sequence, and identifying which ones are homographically related, it can be deduced that these movements having a homographical relationship with some movements of the second sequence, are therefore movements in a plane, as this is a characteristic of homographical relationships. If it is known from scene information that no other movements in a plane are present e.g. all persons are just moving while yet still being seated around the table, it may be concluded that the detected movements are these which correspond to the movements on the screen, thus corresponding to the movements lying in a vertical plane as no other movements in a plane will be present.
In case however people are also moving around the meeting room, movements may also be detected on the horizontal plane of the floor. For these situations an extra filtering step of filtering out the horizontal movements, or alternatively, an extra selection step of selecting only the movements in a vertical plane from all movements detected in a plane, may be appropriate.
Once the vertical movements are found, the respective image parts pertaining to the corresponding features of these vertical movements may then be identified as the virtual visual information
It is to be remarked that verticality is to be determined relative to a horizontal reference plane, which e.g. may correspond to the floor of the meeting room or to the horizontal reference plane of the first camera. Tolerances on the vertical angle, which is typically 90 degrees with respect to this reference horizontal plane, are typically 10 degrees above and below these 90 degrees.
The present invention relates as well to embodiments of a arrangement for performing the present method embodiments, and to a computer program product incorporating code for performing the present method, to an image analyzer for incorporating such an arrangement.
It is to be noticed that the term ‘coupled’, used in the claims, should not be interpreted as being limitative to direct connections only. Thus, the scope of the expression ‘a device A coupled to a device B’ should not be limited to devices or systems wherein an output of device A is directly connected to an input of device B. It means that there exists a path between an output of A and an input of B which may be a path including other devices or means.
It is to be noticed that the term ‘comprising’, used in the claims, should not be interpreted as being limitative to the means listed thereafter. Thus, the scope of the expression ‘a device comprising means A and B’ should not be limited to devices consisting only of components A and B. It means that with respect to the present invention, the only relevant components of the device are A and B.
The above and other objects and features of the invention will become more apparent and the invention itself will be best understood by referring to the following description of an embodiment taken in conjunction with the accompanying drawings wherein
a-b show a more detailed implementations of module 200 of
The description and drawings merely illustrate the principles of the invention. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the invention and are included within its spirit and scope. Furthermore, all examples recited herein are principally intended expressly to be only for pedagogical purposes to aid the reader in understanding the principles of the invention and the concepts contributed by the inventor(s) to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the invention, as well as specific examples thereof, are intended to encompass equivalents thereof.
It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the invention. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
Movement feature extraction takes place in step 200. these movement features can relate to movements of features, such as motion vectors themselves, or can alternatively relate to the aggregate begin and endpoints of these motion vectors pertaining to a single feature, thus more related to the features related to movements themselves. Methods for determining these movements of features are explained with reference to
Once these movements of features are determined, it is to be checked in step 300 whether these pertain to vertical movements, in this document thus meaning movements in a vertical plane. A vertical plane is defined as relative to a horizontal reference plane, within certain tolerances. This horizontal reference plane may e.g. correspond to the floor of the meeting room, or to the horizontal reference plane of the camera or source providing the first sequence of images. Typical values for are 80 to 100 with respect to this reference horizontal plane. How this determination of vertical movements is done, will be explained with reference to e.g.
Methods for determining whether the movements of features are lying in a vertical plane will be described with reference to
Once the movements of features in a vertical plane are determined, these features are to be identified and related back to their respective image parts of the captured images of the source. This is done in steps 400 and 500. These image parts will then accordingly be identified or marked as being virtual information, which can be filtered out, if appropriate.
a-b show more detailed embodiment for extracting the movements of features. In a first stage 201 and 202 features are detected and extracted on the two images I0t0 and I0ti. Features can relate to objects, but also to more abstract items such as corners, lines, gradients, or more complex features such as the ones provided by algorithms such as the scale invariant feature transform, abbreviated by Sift, algorithm. Feature extraction can be done using standard methods such as a canny edge corner detector or this previously mentioned Sift method. As both images 10t0 and 10ti are coming from a same sequence provided by a single source recording a same scene, it is possible to detect movements by identifying similar or matching features in both images. It is however also possible (not shown on these figures) to only detect features on one of the images, and then to determine the movement of these features by the traditional way of determining the motion vectors for all pixels belonging to the detected feature of this image, by conventional block matching techniques for determining motion vectors between pixels or macroblocks.
In the embodiments depicted in
On
The result of this optional filtering step are motion vectors which can be representative of meaningful movements, thus lying above a certain noise threshold. These movement vectors can be provided as such, as is the case in
In a next stage, the thus detected movements of features, or alternatively features related to movements of features, are then to undergo a check for determining whether they pertain to movements in a vertical plane.
Alternatively this second sequence can be provided externally, e.g. from a composing application, which is adapted to create the virtual sequence for being projected on the vertical screen. This composing application may be provided to the arrangement as the source providing the contents to be displayed on the screen, and thus only contains the virtual information, e.g. a virtual scene of all people meeting together in one large meeting room. From this sequence only containing virtual information again images at instances t0 and ti are to be captured, upon which feature extraction and feature movement determination operations are performed. Both identified sets of movements of features are then submitted to a step of determining whether homographical relationships exist between several movements of both sets. The presence of a homographical relationship is indicative of belonging to a same plane. In this respect several sets of movements, each respective set associated to a respective plane will be obtained.
The result of this step is thus one or more sets of movements, each set pertaining to a movement in a plane. This may be followed by an optional filtering or selection step of only selecting these sets of movements pertaining to a vertical plane, especially for these situations where also movements in another plane are to be expected. This may for instance be the case for people walking in the room, which will also create movement on the horizontal floor.
In some embodiments the orientation of the plane relative to the camera, which may be supposed to be horizontally positioned, thus representing a reference horizontal plane, can be calculated from the homography by means of homography decomposition methods which are known to a person skilled in the art and are for instance disclosed in http://hal.archives-ouvertes.fr/docs/00/17/47/39/PDF/RR-6303.pdf. These techniques can then be used for selecting the vertical movements from the group of all movements in a plane.
Upon determination of the vertical movements, the features to which they relate are again determined, followed by their mapping onto the respective parts in the images I0t0 and I0ti, which image parts are then to be identified as pertaining to virtual information.
In case of an embodiment using a second camera or source recording the same scene, the identified vertical movements may also be related back to features and image pads in images I1t0 and I1ti.
While the principles of the invention have been described above in connection with specific apparatus, it is to be clearly understood that this description is made only by way of example and not as a limitation on the scope of the invention, as defined in the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
10306088.5 | Oct 2010 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP11/67210 | 10/3/2011 | WO | 00 | 3/12/2013 |