The present invention relates to systems and methods to provide video signals that include both a person and a virtual object. Some embodiments relate to systems and methods to efficiently and dynamically generate a supplemental video signal to be displayed for the person.
An audio, visual or audio-visual program (e.g., a television broadcast) may include virtual content (e.g., computer generated, holographic, etc.). For example, a sports anchorperson might be seen (from the vantage point of the ‘audience’) evaluating the batting stance of a computer generated baseball player that is not physically present in the studio. Moreover, in some cases, the person may interact with virtual content (e.g., by walking around and pointing to various portions of the baseball player's body). It can be difficult, however, for the person to accurately and naturally interact with the virtual content that he or she cannot actually see. This may occur whether or not the studio anchorperson is actually in the final cut of the scene as broadcast to the ‘audience.’ In some cases, a monitor in the studio might display the blended broadcast image (that is, including both the person and the virtual content). With this approach, however, the person may keep glancing at the monitor to determine if he or she is standing the right area and/or is looking in the right direction. An anchorperson's difficulty in determining where or how to interact with the virtual image can be distracting to viewers of the broadcast and detracting to the quality of the anchorperson's overall interaction, making the entire scene, including the virtual content look less believable, let alone difficult to produce.
Applicants have recognized that there is a need for methods, systems, apparatus, means and computer program products to efficiently and dynamically facilitate interactions between a person and virtual content. For example,
Referring again to
Referring again to
To efficiently and dynamically facilitate interactions between a person and virtual content,
The graphics platform 350 may, according to some embodiments, execute a rendering application, such as the Brainstorm eStudio® three dimensional real-time graphics software package. Note that the graphics platform 350 could be implemented using a Personal Computer (PC) running a Windows® Operating System (“OS”) or an Apple® computing platform, or a cloud-based program (e.g., Google® Chrome®). The graphics platform 350 may use information about the virtual object 330 (e.g., the object's location, motion, appearance, etc.) to create a broadcast or viewer signal that includes images of both the person 320 and the virtual object 330. For example,
Referring again to
At 502, 3D information about a virtual object is received at a graphics platform. For example, a location and dimensions of the virtual object may be determined by the graphics platform. At 504, 3D information associated with a person in a scene may be determined. The 3D information associated with the person might include the person's location, orientation, line of sight, pose, etc. and may be received, for example, from a video camera and/or one or more RTLS sensors using technologies such as RFID, infrared, and Ultra-wideband. . At 504, the graphics platform may create: (i) “a viewer signal” (possibly a video and/or audio signal) of the scene in relation to the person (whether or not actually including the person). For example, a viewer signal may include the virtual element and an animated figure of the person, and (ii) a supplemental signal of the scene (e.g. a video and/or audio signal), wherein the video signal and the supplemental signal are from different perspectives based at least in part on the 3D information. For example, the viewer signal might represent the scene from the point of view of a video camera filming the scene while the supplemental video signal represents the scene from the person's point of view. According to some embodiments, the supplemental video signal is displayed (or transmitted, e.g., via audio) to person to help him or her interact with the virtual object. In an embodiment of this invention the performing person may be a robot.
At 516, location information associated with a spatial relationship between the person and the virtual object may be determined. According to some embodiments, the location information may be determined by sensors or by analyzing the video signal from the camera. Moreover, a plurality of video signals might be received and analyzed by a graphics platform to model the person appearance and to determine a three dimensional location of the person. Other types or location information may include a distance between the person and virtual object, one or more angles associated with the person and virtual object, and/or an orientation of the person (e.g., where he or she is currently looking). Note that other types of RTLS sensor (e.g., using sound waves or any other way of measuring distance).
At 518, a supplemental signal may be created based on the location information. In a particular, the supplemental signal may include a view of the virtual object or a perspective of the virtual object as would be seen or perceived from the person's perspective. The perception of the virtual object might comprise a marker (e.g., a dot or “x” indicating where a person should look, or a sound when a person looks at the right direction), a lower resolution image as compared with the viewer signal, updated with a lower frame rate image as compared with the viewer signal, and/or include a dynamically generated occlusion zone. According to some embodiments, the supplemental signal is further based on an orientation of the person's line of sight (e.g., the supplemental video signal may be updated when a person turns his or her head). Moreover, multiple people and/or virtual objects may be involved in the scene and/or included in the supplemental signal. In this case, a supplemental signal may be created for each person, and each supplemental signal would include a view or perception of the virtual objects as would be seen or perceived from that person's perspective.
The supplemental signal may then be transmitted to a secondary device (e.g., a display device). According to some embodiments, the display device may be worn by the person, such as an eyeglasses display, a retinal display, and/or a contact lens display, or a hearing aid (for rending sound information). Moreover, according to some embodiments, the supplemental signal is wirelessly transmitted to the secondary device, hence, having the supplemental signal and its display to the performing person may be almost transparent to a viewer of the final broadcast.
Moreover, according to some embodiments, a command from the person may be detected and, responsive to said detection, the virtual object may be adjusted. Such a command might comprise, for example, an audible command and/or a body gesture command. For example, a graphics platform might detect that the person has “grabbed” a virtual object and then move the image of the virtual object as the person moves his or her hands. As another example, a person may gesture or verbally order that the motion of a virtual object be paused and/or modified. As another example, a guest or another third person (or group of persons), without access to the devices enabling perception of the virtual image, may gesture or verbally order motion of the virtual object, causing the virtual object to move (and for such movement to be perceived from the perspective of the original person wearing the detection device). For example, when an audience claps or laughs, the sound might cause the virtual object to take a bow, which the person may then be able to perceive via information provided in the supplemental feed
As used herein, the phrases “video feed” and “image” may refer to any signal conveying information about a moving or still image, including audio signals and including a High Definition-Serial Data Interface (“HD-SDI”) signal transmitted in accordance with the Society of Motion Picture and Television Engineers 292M standard. Although HD signals may be described in some examples presented herein, note that embodiments may be associated with any other type of video feed, including a standard broadcast feed and/or a three dimensional image feed. Moreover, video feeds and/or received images might comprise, for example, an HD-SDI signal exchanged through a fiber cable and/or a satellite transmission. Moreover, the video cameras described herein may be any device capable of generating a video feed, such as a Sony® studio (or outside) broadcast camera.
Thus, system and methods may be provided to improve the production of video presentation involving augmented reality technology. Specifically, some embodiments may produce an improved immersive video mixing subjects and a virtual environment. This might be achieved, for example, by reconstructing the subject's video and/or presenting the subject with a “subject-view” of the virtual environment. This may facilitate interactions between subjects and the virtual elements and, according to some embodiments, let a subject alter a progression and/or appearance of virtual imagery through gestures or audible sounds.
Augmented reality may fuse real scene video with computer generated imagery. In such a fusion, the virtual environment may be rendered from the perspective of a camera or other device that is used to capture the real scene video (or audio). Hence, knowledge of the camera's parameters may be required along with distances of real and virtual objects relative to the camera to resolve occlusion. For example, the image of part of a virtual element may be occluded by the image of a physical element in the scene or vice versa. Another aspect of enhancing video presentation through augmented reality is handling the interaction between the real and the virtual elements.
For example, a sports anchorperson may analyze maneuvers during a game or play segment. In preparation for a show, a producer might request a graphical presentation of a certain play in a game that the anchor wants to analyze. This virtual playbook might comprise a code module that, when executed on a three dimensional rendering engine, may generate a three dimensional rendering of the play. The synthesized play may then be projected from the perspective of the studio camera. To analyze the play, the anchor's video image may be rendered so that he or she appears standing on the court (while actually remaining in a studio) among the virtual players. He or she may then deliver the analysis while virtually engaging with the players. To position himself or herself relative to the virtual players, the anchor typically looks at a camera screen and rehearses the movements beforehand. Even then, it may be a challenge to make the interaction between a real person and a virtual person look natural.
Thus, when one or more persons interact with virtual content they may occasionally shift their focus to a video feed of the broadcast signal. This may create two problems. First, a person may appear to program viewers as unfocused because his or her gaze is directed slightly off from the camera shooting the program. Second, the person might not easily interact with the virtual elements, or move through or around a group of virtual elements (whether such virtual elements are static or dynamic). A person who appears somewhat disconnected from the virtual content may undermine the immersive effect of the show. Also note that interactions may be laborious from a production standpoint (requiring several re-takes and re-shoots when the person looks away from the camera, misses a line due to interacting incorrectly with virtual elements, etc.).
To improve interactions with virtual content,
According to some embodiments, the video 640 of the person 620 may be altered to refine his or her pose (and/or possibly appearance) before mixing it with the virtual environment. This may be done by determining the person's three dimensional model including obtaining a three dimensional surface and skeleton representation (for example, based on an analysis of videos from multiple views) so that the image of the person 620 at a certain location and pose in the scene may be altered in relation to the virtual object. According to some embodiments, the person 620 may be equipped with a HMVD 660 (e.g., three dimensional glasses, virtual retinal displays, etc.) through which he or she can view the virtual environment, including the virtual object 630 from his or her perspective. That is, the virtual object 630 may be displayed to the person from his or her own perspective in a way that enhances the person's ability to navigate through the virtual world and to interact with the content without overly complicating the production workflow.
According to some embodiments, a 3D model of the person 620 (including his or her location, pose, surface, and texture and color characteristics) may be obtained through an analysis of the broadcast camera 640 and potentially auxiliary cameras and/or sensors 642 (attached to the person or external to the person). Once a 3D model of the person 620 is obtained, the image of the person 620 may be reconstructed into a new image that shows the person with a new pose and/or appearance relative to the virtual elements. According to some embodiments, the “viewer video” may be served to the audience. In addition, according to some embodiments, the person's location and pose may be submitted to a three dimensional graphic engine 680 to render a supplemental virtual environment view (e.g., a second virtual camera view) from the perspective of the person 620. This second virtual camera view is presented to the person 620 through the HMVD 660 or any other display device. In one embodiment, the second virtual camera view may be presented in a semi-transparent manner, so the person 620 can still see his or her surrounding real-world environment (e.g., studio, cameras, another person 622, etc.). In yet another embodiment, the second camera view might be presented to the person on a separate screen at the studio. Such an approach may eliminate the need to wear a visible display such as HMVD 660, but will require the person 620 to look at a monitor instead of directly at the virtual object 630.
To improve the interaction among real and virtual objects, some embodiments use computer-vision techniques to recognize and track the people 620, 622 in the scene 610. A three dimensional model of an anchor may be estimated and used to reconstruct his or her image at different orientations and poses relative to the virtual players or objects 630, 632. The anchor (or any object relative to the anchor) may be reconstructed, according to some embodiments, at different relative sizes, locations, or appearances. Three dimensional reconstruction of objects may be done, for example, based on an analysis of video sequences from which three dimensional information of static and dynamic objects was extracted.
With two or more camera views, according to some embodiments, an object's or person's structure and characteristics might be modeled. For example, based on stereoscopic matching of corresponding pixels from two or more views of a physical object, the cameras' parameters (pose) may be estimated. Note that knowledge of the cameras' poses in turn may provide for each object's pixel the corresponding real-world-coordinates. As a result, when fusing the image of a physical object with a virtual content (e.g., computer-generated imagery) the physical object's position in the real-world-coordinates may be considered relative to the virtual content to resolve problems of overlap and order (e.g., occlusion issues).
According to some embodiments, an order among physical and graphical elements may be facilitating using a depth map. A depth map of a video image may provide the distance between a point in the scene 610 (projected in the image) and the camera 640. Hence, a depth map may be used to determine what part of the image of a physical element should be rendered into the computer generated image, for example, and what part is occluded by a virtual element (and therefore should not be rendered). According to some embodiments, this information may be encoded in a binary occlusion mask. For example, mask pixel set to “1” might indicates that a physical element's image pixel should be keyed-in (i.e., rendered) while “0” indicates that it should not be keyed-in. A depth map may be generated, according to some embodiments, either by processing the video sequences of multiple views of the scene or by a three dimensional cameras such as a Light Detection And Ranging (“LIDAR”) camera. A LIDAR camera may be associated with an optical remote sensing technology that measures the distance to, or other properties, of a target by illuminating the target with light (e.g., using laser pulses). A LIDAR camera may use ultraviolet, visible, or near infrared light to locate and image objects based on the reflected time of flight. This information may then be used in connection with any of the embodiments described herein. Other technologies utilizing RF, infrared, and Ultra-wideband signals may be used to measure relative distances of objects in the scene. Note that a similar effect might be achieved using sound waves to determine an anchorperson's location.
Note that a covered scene might include one or more persons (physical objects) 620, 622 that perform relative to virtual objects or elements 630, 632. The person 620 may have a general idea as to the whereabouts and motion of these virtual elements 630, 632, although he or she cannot “see” them in real life. Capturing the scene may be a broadcast (main) camera 640. A control-system 660 may drive the broadcast camera 640 automatically or via an operator. In addition to the main camera 640, there may be any number of additional cameras and/or sensors 642 that are positioned at the scene 610 to capture video or any other telemetry (e.g. RF, UWB, audio, etc.) measuring the appearance and structure of the scene 610.
The control-system 660, operated either automatically or by an operator, may manage the production process. For instance, a game (e.g., a sequence of computer-generated imagery data) including a court/field with playing athletes (e.g., the virtual objects 630, 632) may be selected from a CGI database 670. A camera perspective may then be determined and submitted to the broadcast camera 640 as well as to a three dimensional graphic engine 680. The graphic engine 680 may, according to some embodiments, receive the broadcast camera's model directly from camera-mounted sensors. According to other embodiments, vision-based methods may be utilized to estimate the broadcast camera's model. The three dimensional graphic engine 680 may render the game from the same camera perspective as the broadcast camera 640. Next, the virtual render of the game may be fused with video capture of the people 620, 622 in the scene 110 to show all of the elements in an immersive fashion. According to some embodiments, a “person” may be able to know where he or she is in relation to the virtual object without having to disengage from the scene itself. In the event of a “horror” movie, for instance, the person may never have to see the virtual object in order to react logically to its eerie presence, the totality of which is being transmitted to the viewing audience. Note that this interaction may be conveyed to the audience, such as by merging the virtual and physical into the “viewer video.”
Based on analyses of the video, data, and/or audio streams and the telemetry signals fed to the video processor unit 650, various information details may be derived, such as: (i) the image foreground region of the physical elements (or persons), (ii) three dimensional modeling and characteristics, and/or (iii) real world locations. Relative to the virtual elements' presence in the scene 610, as defined by the game (e.g., in the CGI database 670), the physical elements may be reconstructed. Moreover, the pose and appearance of each physical element may be reconstructed resulting in a new video or rendition (replacing the main camera 640 video) in which the new pose and appearance is in a more appropriate relation to the virtual objects 630, 632. The video processor 650 may also generate an occlusion mask that, together with the video, may be fed into a mixer 690, where fusion of the video and the computer-generated-imagery takes place.
In some embodiments, the person 620 interacting with one or more virtual objects 630, 632, uses, for example, an HMVD 660 to “see” or perceive these virtual elements from his or her vantage point. Moreover, the person 620 (or other persons) may be able, using gestures or voice, to affect the virtual object's motion. In some embodiments, a head-mounted device for tracking the person's gaze may be used as means for interaction.
The video processor 650, where the person's location and pose in the real world coordinates are computed, may send the calculated person's perspective or an altered person's perspective to the three dimensional graphic engine 680. The graphic engine 680, in turn, may render the virtual elements and/or environment from the received person's perspective (vantage point) and send this computer generated imagery, wirelessly, to the person's HMVD 660. The person's gesture and/or voice may be measured by the plurality of cameras and sensors 642, and may be recognized and translated by the video processor 650 to be interpreted as a command. These commands may be used to alter the progression and appearance of the virtual play (e.g., pause, slow-down, replay, any special effect, etc.).
Thus, based on the person's location and movements, a version of the virtual image may be transmitted separately to the person 620. Note that this version can be simple, such as CAD-like drawings or an audio “beep,” or more complex, such as the entire look and feel of the virtual image used for the broadcast program. As the person 620 moves, or as the virtual objects 630, 632 move around the person 620, the person 620 may see the virtual image from his or her own perspective, and the virtual image presented as part of the programming may change. The person 620 may interact with the virtual objects 630, 632 (to make them appear, disappear, move, change, multiply, shrink, grow, etc.) through gestures, voice, or any other means. This image may then be transmitted to the person's “normal” eyeglasses (or contact lenses) through which the image is beamed to the person's retina (e.g., a virtual retinal displays) or projected on the eyeglasses' lenses. A similar effect could be obtained using hearing devices (e.g., where a certain sound is transmitted as the person interacts with the virtual object).
Note that some embodiments may be applied to facilitate interaction between two or more persons captured by two or more different cameras from different locations (and, possibly, different times). For example, an interviewer at the studio, with the help of an HMVD may “see” the video reconstruction of an interviewee. This video reconstruction may be from the perspective of the interviewer. Similarly, the interviewee may be able to “see” the interviewer from his or her perspective. Such a capability may facilitate a more realistic interaction between the two people.
The processor 710 is also in communication with an input device 740. The input device 740 may comprise, for example, a keyboard, a mouse, computer media reader, or even a system such as that described by this invention. Such an input device 740 may be used, for example, to enter information about a virtual object, a background, or remote and/or studio camera set-ups. The processor 710 is also in communication with an output device 750. The output device 750 may comprise, for example, a display screen or printer or audio speaker. Such an output device 750 may be used, for example, to provide information about a camera set-up to an operator.
The processor 710 is also in communication with a storage device 730. The storage device 730 may comprise any appropriate information storage device, including combinations of magnetic storage devices (e.g., hard disk drives), optical storage devices, and/or semiconductor memory devices such as Random Access Memory (RAM) devices and Read Only Memory (ROM) devices.
The storage device 730 stores a graphics platform application 735 for controlling the processor 710. The processor 710 performs instructions of the application 735, and thereby operates in accordance any embodiments of the present invention described herein. For example, the processor 710 may receive a scene signal, whether or not including an image of a person, from a video camera. The processor 710 may insert a virtual object into the scene signal to create a viewer signal, such the viewer signal perspective is from a view of the virtual object as would be seen from the video camera's perspective. The processor 710 may also create a supplemental signal, such that the supplemental video includes information related to the view of the virtual object as would be seen from the person's perspective.
As used herein, information may be “received” by or “transmitted” to, for example: (i) the graphics platform 700 from other devices; or (ii) a software application or module within graphics platform 700 from another software application, module, or any other source.
As shown in
Thus, embodiments described herein may use three dimensional information to adjust and/or tune the rendering of a person or object in a scene, and thereby simplify preparation of a program segment. It may let the person focus on the content delivery, knowing that his or her performance may be refined by reconstructing his or her pose, location, and, according to some embodiments, appearance relative to the virtual environment. Moreover, the person may receive a different image and/or perspective from what is provided to the viewer. This may significantly improve the person's ability to interact with the virtual content (e.g., reducing the learning curve for the person and allowing production to happen with fewer takes). In addition, to change or move a virtual element, a person may avoid interacting with an actual monitor (like a touch screen) or pre-coordinate his or her movements so as to appear as if an interaction is happening. That is, according to some embodiments described herein, a person's movements (or spoken words, etc.) can cause the virtual images to change, move, etc. Further, embodiments may reduce the use of bulky monitors, which may free up studio space and increase the portability of the operation (freeing a person to work in a variety of studio environments, including indoor and outdoor environments).
The following illustrates various additional embodiments of the invention. These do not constitute a definition of all possible embodiments, and those skilled in the art will understand that the present invention is applicable to many other embodiments. Further, although the following embodiments are briefly described for clarity, those skilled in the art will understand how to make any changes, if necessary, to the above-described apparatus and methods to accommodate these and other embodiments and applications.
Although three dimensional effects have been described in some of the examples presented herein, note that other effects might be incorporated in addition to (or instead of) three dimensional effects in accordance with the present invention. Moreover, although specific hardware and data configurations have been described herein, note that any number of other configurations may be provided in accordance with embodiments of the present invention (e.g., some of the information associated with the databases and engines described herein may be split, combined, and/or handled by external systems). Further note that embodiments may be associated with any number of different types of broadcast programs (e.g., sports, news, and weather programs).
The present invention has been described in terms of several embodiments solely for the purpose of illustration. Persons skilled in the art will recognize from this description that the invention is not limited to the embodiments described, but may be practiced with modifications and alterations limited only by the spirit and scope of the appended claims.
This patent application claim the benefit of U.S. Provisional Patent Application No. 61/440,675 entitled “Interaction with Content Through Human Computer Interface” and filed on Feb. 8, 2011. The entire contents of that application are hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
61440675 | Feb 2011 | US |