The present invention concerns the combination of real and virtual images, also known as augmented reality, and more particularly a method and devices for the real time insertion of virtual objects into a representation of a real scene using position and orientation data obtained from that scene.
The mirror effect using a camera and a display screen is employed in numerous applications, in particular in the field of video games. The principle of this technology consists in acquiring an image from a webcam type camera connected to a computer or a console. This image is preferably stored in the memory of the system to which the camera is connected. An object tracking algorithm, also known as a blobs tracking algorithm, is used to calculate in real time the contours of certain elements such as the head and the hands of the user. The position of these shapes in the image is used to modify or deform certain parts of the image displayed. This solution enables an area of the image to be located with two degrees of freedom.
One solution for determining the position and the orientation with which a virtual object must be inserted into an image representing a real scene is to indicate in the real scene the position and the orientation of the virtual object. A sphere can be used for this. The size of the sphere must be sufficient to enable its position to be calculated in a three-dimensional space according to its position in a two-dimensional representation of that space and according to its apparent diameter. The orientation of the sphere can be evaluated by placing colored patches on its surface. This solution is efficacious if the sphere is of sufficiently large size and if the image capture system is of sufficiently good quality, which restricts the possibilities of movement of the user, in particular fast movements.
However, these solutions do not offer the performance required for many applications and there exists a requirement to improve the performance of such systems at the same time as keeping their cost at an acceptable level.
The invention solves at least one of the problems stated above.
Thus the invention consists in a method for inserting in real time in at least one image, called the first image, of a stream of images representing a real scene at least one image, called the second image, extracted from at least one three-dimensional representation of at least one virtual object, this method being characterized in that it includes the following steps:
Thus the method of the invention determines accurately and in real time the position at which the virtual object or objects must be inserted and the orientation with which the virtual object or objects must be represented. The position and the orientation of the virtual objects are defined with six degrees of freedom. The calculation time and accuracy cater for augmented reality applications such as video games in which the gestures of users are tracked even if the users move quickly. This solution allows great freedom of movement.
In one particular embodiment, at least some of said orientation data is received from an angular sensor present in said real scene.
Again in one particular embodiment, at least some of said position data is received from a position sensor present in said real scene.
Alternatively, in another particular embodiment, a portion of said position and orientation data is received from a sensor present in said real scene and another portion of said position and orientation data is extracted from said acquired first image.
At least some of said position and orientation data is advantageously extracted from said acquired first image from a singular geometrical element associated with said sensor, enabling accurate location of the place where the virtual object or objects must be inserted.
Again in one particular embodiment, the method further includes the following steps:
These steps improve location of the singular geometrical element in the image from the image stream. The position of said singular element in the real scene is advantageously determined from the position of said singular element in said first image and from the apparent size of said singular element in said first image.
In one particular embodiment, the method further includes a step of estimating said position of said virtual object. Comparing the position that has been estimated and the position that has been determined increases the accuracy of the position at which the virtual object or objects must be inserted. Said step of estimation of said position of said virtual object preferably uses a low-pass filter.
The invention also consists in a computer program comprising instructions adapted to execute each of the steps of the method described above.
The invention further consists in removable or non-removable information storage means partly or completely readable by a computer or a microprocessor and containing code instructions of a computer program for executing each of the steps of the method described above.
The invention also consists in an augmented reality device that can be connected to at least one video camera and to at least one display screen, said device including means adapted to execute each of the steps of the method described above.
The invention also consists in a device for inserting in real time in at least one image, called the first image, of a stream of images representing a real scene at least one image, called the second image, extracted from at least one three-dimensional representation of at least one virtual object, this device being characterized in that it includes:
The device of the invention therefore determines accurately and in real time the position at which the virtual object(s) must be inserted and the orientation in which the virtual object(s) must be represented, with six degrees of freedom, for augmented reality applications.
One particular embodiment of the device further includes means for extracting at least some of said position and orientation data from said acquired first image.
The invention further consists in a device comprising at least one visible singular geometrical element and one sensor adapted to transmit position and/or orientation information to the device described above.
Being reliable and economic, such a device can be used for private applications.
Other advantages, aims and features of the present invention emerge from the following detailed description, which is given by way of nonlimiting example and with reference to the appended drawings, in which:
According to the invention, the data relating to the position and/or the orientation of the virtual object to be inserted into a representation of a real scene is obtained at least in part from a sensor situated in the real scene.
In a first embodiment the sensor 135 is a sensor for determining a position and an orientation in the real scene 120 with six degrees of freedom (X, Y, Z, bearing, pitch, roll). For example, this sensor can be a Fastrack sensor from Polhemus (Fastrack is a registered trade mark). In a second embodiment, the sensor 135 is a sensor for determining an orientation (bearing, pitch, roll) with three degrees of freedom, the position (X, Y, Z) of the sensor being determined by visual analysis of images from the camera 125.
Thus the system 100 consists of the following elements:
The computer contains an augmented reality application such as the D'Fusion software from Total Immersion (D'Fusion is a trade mark of the company Total Immersion) for generating an interactive augmented reality scene using the following functions, for example:
The principle of this type of application is described in patent application WO 2004/012445.
Thus the D'Fusion software displays the synthetic objects in real time and with the position and orientation that have been determined. The user can also interact with other virtual objects inserted into the video stream.
The equipment 200 preferably includes a communication bus 202 to which are connected:
The equipment 200 can optionally also include the following elements:
The communication bus enables communication between and interworking of the various elements included in or connected to the equipment 200. The representation of the bus is not limiting on the invention and in particular the central processing unit is able to communicate instructions to any element of the equipment 200 either directly or via another element of the equipment 200.
The executable code of each program enabling the programmable equipment to implement the method of the invention can be stored on the hard disk 220 or in the read-only memory 206, for example.
Alternatively, the executable code of the programs could be received from the communication network 228 via the interface 226, to be stored in exactly the same way as described above.
The memory cards can be replaced by any information medium such as a compact disk (CD-ROM or DVD), for example. Generally speaking, memory cards can be replaced by information storage means readable by a computer or by a microprocessor, possibly integrated into the equipment, possibly removable, and adapted to store one or more programs the execution of which implements the method of the invention.
More generally, the program or programs could be loaded into one of the storage means of the equipment 200 before being executed.
The central processing unit 204 controls and directs execution of the instructions or software code portions of the program or programs of the invention, which are stored on the hard disk 220, in the read-only memory 206 or in the other storage elements cited above. On powering up, the program or programs stored in a nonvolatile memory, for example on the hard disk 220 or in the read-only memory 206, are transferred into the random-access memory 208, which then contains the executable code of the program or programs of the invention and registers that are used for storing the variables and parameters necessary for implementing the invention.
It should be noted that the communication equipment including the device of the invention can also be a programmed equipment. It then contains the code of the computer program or programs, for example in an application-specific integrated circuit (ASIC).
The computer 130′ includes a video acquisition card 210 connected to the video camera 125, a graphics card 216 connected to the screen 115, a first communication port (COM1) 214-1 connected to the position and orientation sensor of the handgrip 300 via the unit 305, and a second communication port (COM2) 214-2 connected to a trigger switch of the handgrip 300, preferably via the unit 305. A trigger switch is a switch for opening or closing the corresponding electrical circuit quickly when pressure is applied to the trigger. Use of such a switch enhances interactivity between the user and the software of the computer 130′. The switch 310 is used to simulate firing in a game program, for example. The video acquisition card is a Decklinck PCIe card, for example. The graphics card is a 3D graphics card enabling insertion of synthetic images into a video stream, for example an ATI X1800XL card or an ATI 1900XT card. Although the example shown uses two communication ports (COM1 and COM2), it must be understood that other communication interfaces can be used between the computer 130′ and the handgrip 300.
The computer 130′ advantageously includes a sound card 320 connected to loudspeakers (LS) integrated into the screen 115. The connection between the video acquisition card 210 and the video camera 125 can conform to any of the following standards: composite video, SVideo, HDMI, YUV, YUV-HD, SDI, HD-SDI or USB/USB2. Likewise, the connection between the graphics card 216 and the screen 115 can conform to one of the following standards: composite video, Svideo, YUV, YUV-HD, SDI, HD-SDI, HDMI, VGA. The connection between the communication ports 214-1 and 214-2, the sensor and the trigger switch of the handgrip 300 can be of the RS-232 type. The computer 130′ is, for example, a standard PC equipped with a 3 GHz Intel Pentium IV processor, 3 Gbytes of RAM, a 120 Gbytes hard disk and two express PCI (Peripheral Component Interconnect) interfaces for the acquisition cards and the graphics card.
The handgrip 300 preferably includes a position and orientation sensor 135′ with six degrees of freedom, for example a “Fastrack” sensor from Polhemus, and a trigger switch 310. An example of the handgrip 300 is shown in
The unit 305 constitutes an interface between the handgrip 300 and the computer 130′. The function of the unit 305, which is associated with the position and orientation sensor, is to transform signals from the sensor into data that can be processed by the computer 130′. The module 305 includes a movement capture module 315 and advantageously includes a sender enabling wireless transmission of signals from the sensor 135′ to the unit 305.
Alternatively, the electrical wiring harness is dispensed with and a wireless communication module inserted into the handgrip 300. Data from the sensor 135′ is then transmitted to the unit 305 with no cable connection. The trigger switch is then inactive unless it is also coupled to a wireless communication module.
In one particular embodiment a background of uniform color, for example a blue background or a green background, is placed behind the user. This uniform background is used by the software to “clip” the user, i.e. to extract the user from the images coming from the video camera 115 and embedded in a synthetic scene or in a secondary video stream. To insert the user into a synthetic scene, the D'Fusion software uses its chromakey capability (for embedding a second image in a first image according to a color identified in the first image) in real time using a pixel shader function to process the video stream from a camera.
Although the device described above with reference to the first embodiment is entirely satisfactory in terms of the result, the cost of the position and orientation sensor with six degrees of freedom may be prohibitive for personal use. To overcome this drawback, the second embodiment is based on the use of a low-cost movement sensor combined with an image processing module for obtaining the position and the orientation of the sensor with six degrees of freedom.
The handgrip 600 is advantageously connected to the computer 130″ wirelessly, with no movement capture unit. The handgrip 600 includes an orientation sensor 135″ capable of determining the orientation of the handgrip 600 with three degrees of freedom. The orientation sensor 135″ is the MT9B angular sensor from Xsens, for example, or the Inertia Cube 3 angular sensor from Intersense, in either the cable or the wireless version. The orientation data from the sensor can be transmitted via a port COM or using a specific wireless protocol. One or more trigger switches 310 are preferably included in the handgrip 600. The handgrip 600 also includes a geometrical element having a particular shape for locating the handgrip 600 when it is visible in an image. This geometrical shape is a colored sphere with a diameter of a few centimeters, for example. However, other shapes can be used, in particular a cube, a plane or a polyhedron. To enable coherent positioning of the angular sensor, the handgrip 600 is preferably of cranked shape to oblige the user to hold it in a predetermined way (with the fingers positioned according to the cranked shape).
As shown, the handgrip 600 includes a lower part, also called the butt, adapted to be held by the user, and inside which is a battery 700, for example a lithium battery, a wireless communication module 705 and the trigger switches 310. The angular sensor 135″ is preferably fixed to the upper part of the butt, which advantageously includes a screwthread at its perimeter for mounting the geometrical element used to identify the position of the handgrip. In this example, the geometrical element is a sphere 615, which includes an opening adapted to be screwed onto the upper part of the butt. It must be understood that other means can be used for fixing the geometrical element to the butt, such as gluing or nesting means. A light source such as a bulb or a LED (light-emitting diode) is advantageously disposed inside the sphere 615, which is preferably made from a transparent or translucent material. This light source and all the electrical components of the handgrip 600 are activated at the command of the user or, for example, as soon as the handgrip 600 is detached from the support 715 that is used to store the handgrip and, advantageously, to charge the battery 700. In this case, the support 700 is connected to an electrical supply.
b shows the electrical circuit diagram of a first arrangement of the electrical components of the handgrip 600. The battery 700 is connected to the orientation sensor 135″, the trigger switch 310, the wireless transmission module 705 and the light source 710 to supply them with electrical power. A switch 720 is advantageously disposed at the output of the battery 700 for cutting off or activating the supply of electrical power to the orientation sensor 135″, the trigger switch 310, the wireless transmission module 705 and the light source 710. The switch 720 can be operated manually by the user or automatically, for example when the handgrip 600 is taken out of the support 715. It is also possible to modify how the battery 700 is connected so that the switch controls only some of the aforementioned components. It is equally possible to use a number of switches to control these components independently, for example so that the handgrip 600 can be used without activating the light source 710.
The orientation sensor 135″ and the trigger switch 310 are connected to the wireless transmission module 705 to transfer information from the sensor 135″ and the contactor 310 to the computer 130″. The wireless transmission module 705 is for example a radio-frequency (RF) module such as a Bluetooth or WiFi module. A corresponding wireless communication module is connected to the computer 130″ to receive signals sent by the handgrip 600. This module can be connected to the computer 130″ using a USB/USB2 or RS-232 interface, for example.
Alternatively, if a cable connection is used between the handgrip 600 and the computer 130″, the handgrip 600 does not necessitate either the wireless communication module 715 or the battery 700, as the computer can supply the handgrip 600 with electricity. This alternative is shown in
The handgrip 600 has a cranked shape to avoid uncertainty as to the orientation of the handgrip relative to the sensor. Whilst enabling a great number of movements of the user, the geometrical element used to determine the position of the handset is easily visible in an image. This geometrical element can easily be removed to change its color and its shape. Finally, the presence of a light source in the geometrical element or on its surfaces improves the tracking of this object under poor lighting conditions.
To determine the position of the handgrip, the computer 130″ analyzes the images from the video camera or the webcam 125 in which the geometrical element of the handgrip 600 is present. In this step it is essential to find the position of the center of the geometrical element in the images coming from the camera accurately. The solution is to use a new color space and a new filtering approach to enhance the quality of the results obtained.
To increase the accuracy of the results obtained, it is preferable to estimate the sought position theoretically (step 825) by linear extrapolation of the position of the geometrical element in time and comparing the estimated position with the position obtained by image analysis (step 830).
The principle of determining the position of the geometrical element consists firstly in detecting areas of the image whose color corresponds to that of the geometrical element. To overcome brightness variations, the HSL color space is preferable to the RGB color space.
After converting an RGB image into an HSL image, all the pixels are selected and the pixels for which the luminance L is not in a predefined range [θLinf;θLsup] are deselected. A pixel can be deselected by imposing zero values for the luminance L, the saturation S and the hue H, for example. Thus all the pixels selected have non-zero values and the deselected pixels have a zero value.
Segmenting an image using an HSL color space gives results that are not entirely satisfactory because a very dark or very light pixel (but not one that is black and not one that is white) can have virtually any hue value (which value can change rapidly because of noise generated in the image during acquisition), and thus have a hue close to that sought. To avoid this drawback, the HSL color space is modified so as to ignore pixels that are too dark or too light. For this purpose, a new saturation S is created. The saturation S′ is derived from the saturation S by applying a weighting coefficient α linked to the luminance L by the following equation: S′=αS. The weighting coefficient a preferably has a value between 0 and 1.
Thus pixels whose saturation S′ is not above a predefined threshold θS′inf are deselected. Likewise, pixels whose hue H does not correspond to the hue of the geometrical element, i.e. pixels not in a predetermined range [θHinf;θHsup] depending upon the hue of the geometrical element, are deselected. It should be noted that the hue is in theory expressed in degrees, varying from 0 to 360°. In fact, hue is a cyclic concept, with “red” at both ends (0 and 360). From a practical point of view, as 360 cannot be coded on one byte, the hue value is recoded, depending on the target applications, over the ranges [0,180[, [0,240[ or [0,255]. To optimize the calculation cost, the range [0,180[ is preferable. It should nevertheless be noted that the loss generated by the change of scale has no important effect on the results.
Pixels are preferably deselected in the order luminance L, saturation S′ and then hue H. However, the essential phase is segmentation according to the hue H. Segmentation according to the luminance and the saturation enhances the quality of the results and the overall performance, in particular because it optimizes the calculation time.
Some of the selected pixels of the image represent the geometrical element. To identify them, a contour extraction step is used. This consists in extracting the contours of the non-zero pixel groups using a convolution mask, for example. It should be noted here that there are numerous contour extraction algorithms.
It is then necessary to determine which of the contours most closely approximates the shape of the geometrical element used, which here is a circular contour, the geometrical element used being a sphere.
All contours whose size in pixels is too small to represent a circle of usable size are deselected. This selection is effected according to a predetermined threshold θT. Likewise, all contours whose area in pixels is too low are eliminated. This selection is again effected according to a predetermined threshold θA.
The minimum radius of a circle enclosing the contour is then calculated for each of the remaining contours, after which the ratio between the area determined by the contour and the calculated radius is evaluated. The required contour is that yielding the highest ratio. This ratio reflects the fact that the contour fills the circle that surrounds it to the maximum and thus gives preference simultaneously to contours that tend to be circular and contours of greater radius. This criterion has the advantage of a relatively low calculation cost. The selection criterion must naturally be adapted to the shape of the geometrical element.
Colorimetric and geometric segmentation yield a circle in the image approximately representing the projection of the sphere associated with the handgrip. An advantage of this solution is that if the shape and the color of the geometrical element are unique in the environment, then recognition of that shape is robust in the face of partial occlusion.
The position of the geometrical element in the space in which it is located is then determined from its projection in the image. To simplify the calculations it is assumed here that the geometrical element is situated on the optical axis of the camera producing the image. In reality, the projection of a sphere generally gives an ellipse. A circle is obtained only if the sphere is on the optical axis. Such an approximation is nevertheless sufficient to determine the position of the sphere from its apparent radius thanks to a simple ratio of proportionality.
It should furthermore be noted that the ratio
is equal to the ratio
where fp is the focal distance in pixels and rp is the apparent radius of the sphere in pixels. From this the following equation is deduced:
The projection of a point with coordinates (x, y, z) in a system of axes the origin of which is the camera into coordinates (u, v) in the image taken by the camera is expressed as follows:
where (px, py) is the position in pixels of the optical center in the image. This equation is used to deduce the real coordinates X and Y of the sphere when its real coordinate Z and its coordinates u and v in the image, all in pixels, are known:
It is important to note that the quality of the estimated radius of the sphere has a great impact on the quality of the Z position which consequently impacts on the quality of the X and Y positions (which are furthermore also impacted by the quality of the estimated 2D position of the circle). This Z error may be large metrically and also visually because the virtual object associated with the sensor is generally larger than the sphere and consequently an error that overestimates the radius demultiplies commensurately the apparent size of the virtual object inserted in the image in proportion to how much that object is (metrically) larger than the sphere.
A serious problem in seeking the real position of the geographical element stems from the lack of stability over time of the estimated position (u, v) and the estimated radius of the circle. This problem is reflected in serious vibration of the X, Y and Z position of the virtual object associated with the sensor. Special filtering is used to filter these vibrations.
That special filtering is based on the principle that prediction based on low-pass filtering can be carried out and that that filtered value is then applied if the prediction is fairly close to the new measurement. As soon as the prediction departs from the measurement, a “wait” phase verifies whether the error exists only in an isolated image from the video stream or is confirmed over time. The filtered value resulting from the prediction process is applied. If the first error is confirmed the real value of the measurement is applied with a delay of one image in the video stream. Low-pass filtering is applied to the last n measurements (excluding those considered abnormal) using orthogonal linear regression (because orthogonal quadratic regression gives results of lower quality). The value of n is variable with a value that increases up to a predetermined threshold as long as the predictions are conform. As soon as a prediction is no longer conform, following a variation confirmed by the next image, the value of n drops to 4 for minimum filtering. This technique makes filtering more responsive and is based on the principle that the vibrations are more visible if the radius is deemed to be fairly constant. In contrast, the vibrations are not very perceptible in movement and it is therefore possible to reduce the latency.
The following equations show the detailed linear orthogonal regression calculation using a straight line with the equation y=ax+b, where x corresponds to the value of the current frame and y to the value of the three parameters u, v and the apparent radius of the sphere, each having to be filtered independently.
The error between the linear orthogonal regression and the measurement at the point pi(xi, yi) can be expressed in the form:
e
i=(axi+b)−yi
It is thus necessary to minimize the total quadratic error E, which can be expressed as follows:
by setting:
as a result of which:
E=a
2
sx2+2absx+b2n−2asxy−2bsy+sy2
The function E being a quadratic function, it takes its minimum value when:
Consequently:
where det=sx2·n−sx2.
For each image from the video stream from the camera, the values a and b are estimated in order to predict a value for the coordinates (u, v) and for the apparent radius of the sphere in order to deduce from these the coordinates (x, y, z) of the sphere in the real scene. These estimated values are used as a reference and compared to values determined by image analysis as described above. The values determined by image analysis are used instead of the predicted values or not according to the result of the comparison.
When the position and the orientation of the virtual object in the real scene have been determined, the augmented reality software, for example the D'Fusion software, determines the image of the virtual object to be inserted from the three-dimensional model of that object. This image of the virtual object is thereafter inserted into the image of the real scene.
The process of determining the position and the orientation of the virtual object in the real scene, determining the image of the virtual object and inserting the image of the virtual object in an image of the real scene is repeated for each image in the video stream from the camera.
The augmented reality software can also be coupled to a game, thus enabling users to see themselves “in” the game.
Naturally, a person skilled in the field of the invention could make modifications to the foregoing description to satisfy specific requirements. In particular, it is not imperative to use a sensor present in the real scene and having at least three degrees of freedom. The only constraint is that data from the sensor must complement data from image analysis. For example, this makes it possible to use a sensor having two degrees of freedom and to obtain information linked to the four other degrees of freedom by image analysis. Likewise, the handgrip comprising the position and orientation sensor can take forms other than those described.
Number | Date | Country | Kind |
---|---|---|---|
0752547 | Jan 2007 | FR | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/FR08/00011 | 1/3/2008 | WO | 00 | 11/12/2009 |