The present invention relates generally to real-time communication streams, e.g., chat or teleconferencing sessions that typically include video but are not required to do so, and more specifically to mining of multimodal data in the communication streams for use in altering at least one characteristic of the stream. The altered stream can present (audibly and/or visually) new content that is related to at least some of the mined data.
Manipulation of video data is often employed in producing commercial films, but is becoming increasingly more important in other applications, including video streams available via the Internet, for example chat sessions that can include video. One form of video manipulation is the so-called green screen substitution, which motion picture and television producers use to create composite image special effects. For example, actors or other objects may be filmed in the foreground of a scene that includes a uniformly lit flat screen background having a pure color, typically green. A camera using conventional color film or an electronic camera with a sensor array of red, green, blue (RGB) pixels captures the entire scene. During production, the background green is eliminated based upon its luminance, chroma and hue characteristics, and a new backdrop substituted, perhaps a blue sky with wind blown white clouds, a herd of charging elephants, etc. If the background image to be eliminated (the green screen) is completely known to the camera, the result is a motion picture (or still picture) of the actors in the foreground superimposed almost seamless in front of the substitute background. When done properly, the foreground images appear to superimpose over the substitute background. In general there is good granularity at the interface between the edges of the actors or objects in the foreground, and the substitute background. By good granularity it is meant that the foreground actors or objects appear to meld into the substitute background as though the actors had originally been filmed in front of the substitute background. Successful green screen techniques require that the green background be static, e.g., there be no discernable pattern on the green background such that any movement of the background relative to the camera would go undetected. But the relationship between camera and background must be static for backgrounds that have a motion-discernable pattern. If this static relationship between camera and background is not met, undesired results can occur such as portions of the foreground being incorrectly identified as background or vice versa.
Green screen composite imaging is readily implemented in a large commercial production studio, but can be costly and require a large staging facility, in addition to special processing equipment. In practice such imaging effects are typically beyond the reach of amateur video producers and still photographers.
It is also known in the art to acquire images using three-dimensional cameras to ascertain Z depth distances to a target object. Camera systems that acquire both RGB images and Z-data are frequently referred to as RGB-Z systems. With respect to systems that acquire Z-data, e.g., depth or distance information from the camera system to an object, some prior art depth camera systems approximate the distance or range to an object based upon luminosity or brightness information reflected by the object. But Z-systems that rely upon luminosity data can be confused by reflected light from a distant but shiny object, and by light from a less distant but less reflective object. Both objects can erroneously appear to be the same distance from the camera. So-called structured light systems, e.g., stereographic cameras, may be used to acquire Z-data. But in practice, such geometry based methods require high precision and are often fooled.
A more accurate class of range or Z distance systems are the so-called time-of-flight (TOF) systems, many of which have been pioneered by Canesta, Inc., assignee herein. Various aspects of TOF imaging systems are described in the following patents assigned to Canesta, Inc.: U.S. Pat. No. 7,203,356 “Subject Segmentation and Tracking Using 3D Sensing Technology for Video Compression in Multimedia Applications”, U.S. Pat. No. 6,906,793 Methods and Devices for Charge Management for Three-Dimensional Sensing”, and U.S. Pat. No. 6,580,496 “Systems for CMOS-Compatible Three-Dimensional Image Sensing Using Quantum Efficiency Modulation”, U.S. Pat. No. 6,515,740 “Methods for CMOS-Compatible Three-Dimensional image Sensing Using Quantum Efficiency Modulation”.
Under control of microprocessor 160, a source of optical energy 120, typical IR or NIR wavelengths, is periodically energized and emits optical energy Si via lens 125 toward an object target 20. Typically the optical energy is light, for example emitted by a laser diode or LED device 120. Some of the emitted optical energy will be reflected off the surface of target object 20 as reflected energy S2. This reflected energy passes through an aperture field stop and lens, collectively 135, and will fall upon two-dimensional array 130 of pixel detectors 140 where a depth or Z image is formed. In some implementations, each imaging pixel detector 140 captures time-of-flight (TOF) required for optical energy transmitted by emitter 120 to reach target object 20 and be reflected back for detection by two-dimensional sensor array 130. Using this TOF information, distances Z can be determined as part of the DATA signal that can be output elsewhere, as needed.
Emitted optical energy S1 traversing to more distant surface regions of target object 20, e.g., Z3, before being reflected back toward system 100 will define a longer time-of-flight than radiation falling upon and being reflected from a nearer surface portion of the target object (or a closer target object), e.g., at distance Z1. For example the time-of-flight for optical energy to traverse the roundtrip path noted at t1 is given by t1=2·Z1/C, where C is velocity of light. TOF sensor system 10 can acquire three-dimensional images of a target object in real time, simultaneously acquiring both luminosity data (e.g., signal brightness amplitude) and true TOF distance (Z) measurements of a target object or scene. Most of the Z-pixel detectors in Canesta-type TOF systems have additive signal properties in that each individual pixel acquires a pair of data (i.e., a vector) in the form of luminosity information and also in the form of Z distance information. While the system of
Many Canesta, Inc. systems determine TOF and construct a depth image by examining relative phase shift between the transmitted light signals S1 having a known phase, and signals S2 reflected from the target object. Exemplary such phase-type TOF systems are described in several U.S. patents assigned to Canesta, Inc., assignee herein, including U.S. Pat. No. 6,515,740 “Methods for CMOS-Compatible Three-Dimensional Imaging Sensing Using Quantum Efficiency Modulation”, U.S. Pat. No. 6,906,793 entitled Methods and Devices for Charge Management for Three Dimensional Sensing, U.S. Pat. No. 6,678,039 “Method and System to Enhance Dynamic Range Conversion Useable With CMOS Three-Dimensional Imaging”, U.S. Pat. No. 6,587,186 “CMOS-Compatible Three-Dimensional Image Sensing Using Reduced Peak Energy ”, U.S. Pat. No. 6,580,496 “Systems for CMOS-Compatible Three-Dimensional Image Sensing Using Quantum Efficiency Modulation”
Some of the emitted optical energy (denoted Sout) will be reflected (denoted S2=Sin) off the surface of target object 20, and will pass through aperture field stop and lens, collectively 135, and will fall upon two-dimensional array 130 of pixel or photodetectors 140. When reflected optical energy Sin impinges upon photodetectors 140 in array 130, photons within the photodetectors are released, and converted into tiny amounts of detection current. For ease of explanation, outgoing and incoming optical energy may be modeled as Sout=cos(ω·t), and Sin=A·cos(ω·t+θ) respectively, where A is a brightness or intensity coefficient, ω·t represents the periodic modulation frequency, and θ is phase shift. As distance Z changes, phase shift θ changes, and
In preferred embodiments, pixel detection information is captured at at least two discrete phases, preferably 0° and 90°, and is processed to yield Z data.
System 100 yields a phase shift θ at distance Z due to time-of-flight given by:
θ=2·ω·Z/C=2·(2·π·f)·Z/C (1)
where C is the speed of light, 300,000 Km/sec. From equation (1) above it follows that distance Z is given by:
Z=θ·
C/2·ω=θ·C/(2·2·f·π) (2)
And when θ=2·π, the aliasing interval range associated with modulation frequency f is given as:
Z
AIR
=C/(2·f) (3)
In practice, changes in Z produce change in phase shift θ although eventually the phase shift begins to repeat, e.g., θ=θ+2·π, etc. Thus, distance Z is known modulo 2·π·C/2·ω)=C/2·f, where f is the modulation frequency.
Canesta, Inc. has also developed a so-called RGB-Z sensor system, a system that simultaneously acquires both red, green, blue visible data, and Z depth data.
System 100′ in
In
However the present invention can function with many three-dimensional sensor systems whose performance characteristics may be inferior to those of true TOF systems. Some three-dimensional systems use so-called structured light, e.g., the above-cited U.S. Pat. No. 6,710,770, assigned to Canesta. Other prior art systems attempt to emulate three-dimensional imaging using two spaced-apart stereographic cameras. However in practice the performance of such stereographic systems is impaired by the fact that the two spaced-apart cameras acquire two images whose data must somehow be correlated to arrive at a three-dimensional image. Further, such systems are dependent upon luminosity data, which can often be confusing, e.g., distance bright objects may appear to be as close to the system as nearer gray objects.
Thus there is a need for real-time video processing systems and techniques that can acquire three-dimensional data and provide intelligent video manipulation. Preferably such a system would examine data including at least one of video, audio, and text, and intelligently manipulate all or some of the data. Preferably such a system should retain foreground video but intelligently replace background video with new content that depends on information mined from the video and/or audio and/or textual information in the stream of communication data. Such systems and techniques that operate well in the real world in real-time.
The present invention provides such systems and techniques, both in the context of three-dimensional systems that employ relatively inexpensive arrays of RGB and Z pixels, and for other three-dimensional imaging systems as well.
Embodiments of the present invention provide methods and systems to mine or extract data present during interaction between at least two participants, for example in a communications stream, perhaps a chat or a video session, via the Internet or other transmission medium. The present invention analyzes the data and can create displayable content for viewing by one or more chat session participants responsive to the data. Without limitation, the data from at least one chat session participant includes a characteristic of a participant that can include web camera generated video, audio, keyboard typed information, handwriting recognized information, user-made gestures, etc. The displayable content may be viewed by at least one of the participants and preferably by all. Thus while several embodiments of the present invention are described with respect to mining video data, the data mined can be at least one of video, audio, writing (keyboard entered to hand generated), and gestures, without limitation. Thus the term video chat session can be understood to include a chat session in which the medium of exchange includes at least one of the above-enumerated data.
In one aspect, the present invention combines a video foreground based upon a participant's generated video, with a customized computer generated background that preferably is based upon data mined from the video chat session. The customized background preferably is melded seamlessly with the participant's foreground data, and can be created even in the absence of a video stream from the participant. Such melding can be carried out using background substitution, preferably by combining video information using both RGB or grayscale video and depth video, acquired using a depth camera. In one aspect, the background video includes targeted content such as an advertisement whose content is related to data mined from at least one of the participants in the chat session.
Preferably a participant's foreground video has a transparency level greater than 0%, and is scalable independently of size of the computer generated background. This computer generated background may include a virtual whiteboard useable by a participant in the video chat, or may include an advertisement with participant-operable displayed buttons. Other computer generated background information may include an HTML page, a video stream, a database with image(s), including a database with social networking information. Preferably this computer controlled background is updatable in real-time responsive to at least one content of the video chat. Preferably this computer controlled background can provide information of events occurring substantially contemporaneously with the video chat.
Other features and advantages of the invention will appear from the following description in which the preferred embodiments have been set forth in detail, in conjunction with the accompany drawings.
Aspects of the present invention may be practiced with image acquisition systems that acquire only Z data, and/or RGB data. In embodiments where RGB and Z data are used, the system that acquires RGB data need not be part of the system that detects Z data.
While range finding systems incorporating TOF techniques, as exemplified by system 100″ in
Embodiments of the present invention utilize background substitution, which substitution may be implemented in any number of ways, such that although the background may be substituted, important and relevant information in the foreground image is preserved. In various embodiments, the foreground and/or background images may be derived from a real-time video stream, for example a video stream associated with a chat or teleconferencing session in which at least two users can communicate via the Internet, a LAN, a WAN, a cellular network, etc. In the example of a telephonic communications session or chat, enunciated sounds and words could be mined. Thus if one participant said “I am hungry”, a voice could come into the chat and enunciate “if you wish to order a pizza, dial 123-4567 or perhaps “press 1”, etc.
Commercial enterprises such as Google™ mail insert targeted advertisements in an email based on perceived textual content of the email. Substantial advertising revenue is earned by Google as a result. Embodiments of the present invention intelligently mine data streams associated with chat sessions and the like, e.g., video data and/or audio data and/or textual data, and then alter the background image seen by participants in the chat session to present targeted advertising. In embodiments of the present invention, the presented advertising is interactive in that a user can click or otherwise respond to the ad to achieve a result, perhaps ordering a pizza in response to a detected verbal, audio, textual, visual (including a recognized gesture) indicating hunger. Other useful data may be inserted into the information data stream responsive to contents of the information exchanged advertisements. Such other useful information may include the result of searches based on information exchanged or relevant data pertinent to the exchange.
In one embodiment, system 100″ or 400 includes known textual search infrastructures that can detect audio from a user's system, and then employ speech-to-text translation from the audio. The thus generated text is then coupled into a search engine or program similar to the Google™ mail program. Preferably most relevant fragments of the audio may be extracted so as to reduce queries to the search engine. With respect to
In some embodiments in which the chat session includes a video stream, a new background may be substituted responsive to information exchanged in the chat session. Such background may contain advertisements, branding, or other topics of interest relevant to the session. The foreground may be scaled (up or down or even distorted) so as to create adequate space for information to be presented in the background. The background may also be part of a document being exchanged during the chat or teleconferencing session such as a Microsoft Word™ document or Microsoft Powerpoint™ presentation. Because the foreground contains information that is meaningful to the users, user attention is focused on the foreground. Thus, the background is a good location in which to place information that is intelligently selected from aspects of the chat session data streams. Note that ad information if appropriate may also be overlaid over regions of the foreground, preferably over foreground regions deemed relatively unimportant.
The displayed video foreground may be scaled to fit properly in a background. For example a user's bust may be scaled to make the user look appropriate in a background that includes a conference table. In a video stream in which the foreground includes one or more users, user images may be replaced by avatars that can perform responsively to movements of the users they represent, e.g., if user number 1 raises the right hand to get attention, the displayed avatar can do likewise. Alternately the avatars may just be symbols representing a user participant, or more simply, symbols representing the status of the chat session.
As noted, preferably all modes of communication during the session may be intelligently mined for data. For example in a chat session whose communication stream includes textual chat, intelligent scanning of the textual data stream, the video data stream, and the audio data stream may be undertaken, to derive information. For example, if during the chat session a user types the word “pizza” or says the word “pizza” or perhaps points to an image of a pizza or makes a hunger-type gesture, perhaps rubbing the stomach, the present invention can target at least one, perhaps all user participants with an advertisement for pizza. The system may also keep track of which information came from which participant (e.g. who said what) to further refine its responses.
In one embodiment, the responses themselves may be placed in the text transfer stream, e.g., a pizza ad is placed into the text stream, or is inserted into the audio stream, e.g., an announcer reciting a pizza ad. In some embodiments, the background of the associated video stream is affected by action in the foreground, e.g., a displayed avatar jumps with joy and has a voice bubble spelling out, “I am hungry for Pizza”. It is understood that a computer controlled graphic output responsive to chat session may be implemented with or without the presence of a video stream. The computer controlled response is presented to at least one participant in the chat session, and may of course be presented to several if not to all participants. It is understood that each participant in the chat session may be presented with a different view of the session. Thus in various of
If desired, extraction of the foreground may be overlaid atop background with some transparency, which may be rendered in a manner known in the art, perhaps akin to rendering as in Windows Vista™. So doing allows important aspects of the background to remain visible to the users when the foreground is overlaid. In one embodiment, this overlay is implemented by making the foreground transparent. Alternatively, the foreground may be replaced by computer generated image(s) that preferably are controlled responsive to user foreground movements. Such control can be implemented by acquiring three-dimensional gesture information from the user participated using a three-dimensional sensor system or camera, as described in U.S. Pat. No. 7,340,077 (2008), entitled Gesture Recognition System Using Depth Perceptive Sensors, and assigned to Canesta, Inc., assignee herein. If desired, rather than appearing within its own window, the foreground or computer generated image may be placed directly on a desktop. In such embodiment, this imagery can be rendered in a fashion akin to Windows Word™ help assistants.
In the displayed image of
In
As indicated by
As shown by
Modifications and variations may be made to the disclosed embodiments without departing from the subject and spirit of the present invention as defined by the following claims.
Priority is claimed to co-pending U.S. provisional patent application Ser. No. 61/126,005 filed 30 Apr. 2008 entitled Method and System for Intelligently Mining Data During Video Communications to Present Context-Sensitive Advertisements Using Background Substitution”, which application is assigned to Canesta, Inc., assignee herein.
Number | Date | Country | |
---|---|---|---|
61126005 | Apr 2008 | US |