Embodiments of the present invention relate to transmitting video and sharing content via a network, and in particular, to more efficiently transmitting video and content via a network by transmitting them separately using optimized protocols.
Some video transmission systems merge video and content to be shared into one video stream. In such systems, the video stream may be transmitted using standard video codecs and streaming protocols. Upon receipt, the video and content are displayed for view on a web browser. These systems require no processing of the video stream at the viewer site aside from the processes related to receiving and displaying. Such systems typically treat the combined video and shared content similarly regarding methods of compression, transmission, reception, and display even though different methods may be more efficient or otherwise more suitable for each of the components that went into the video stream.
Where transmission systems send video and content separately, the video itself is typically transmitted using processes that treat the pixels of the video uniformly. Thus, such current transmission systems do not exploit the potential provided by user-extracted video to differentiate between an image part and a background part of the user-extracted video, or between an image part and a non-image part of a user-extracted video combined with another video or other content. Also, current video transmission systems do not support the use of an alpha mask (also known as an “alpha channel”), though there have been efforts to modify current systems to support WebM video with an alpha channel for VP8 video.
Embodiments of the claimed subject matter disclose methods and systems related to transmitting user-extracted video and content more efficiently. These embodiments recognize that user-extracted video provides the potential to treat parts of a single frame of the user-extracted video differently, e.g., the image part of the user-extracted video may be encoded to retain a higher quality upon decoding than the remainder of the user-extracted video. Such different treatment of the parts of a user-extracted video may allow more efficient transmission. According to such embodiments, a user-extracted video is created along with an associated alpha-mask, which identifies the image part of the user-extracted video. If the image part is more important than the remainder of the user-extracted video, e.g., if it is a higher priority to have a high-resolution image part, it is processed for transmission using methods that preserve its quality or resolution in comparison to the remainder of the user-extracted video. During this processing the alpha mask is used to differentiate between the image part and the remainder of the user-extracted video. The processed video is then sent to a receiving computer.
In an embodiment, content is also selected and combined with the user-extracted video to create a composite video. During processing, the alpha mask is then used to differentiate between the image part and, in this embodiment, the remainder of the composite video.
In an embodiment, a chroma-key is employed to include the alpha mask in the encoded video. Dechroma-keying is then used to re-generate the alpha mask from the sent and decoded video. The re-generated alpha mask is used to determine an alpha value for each pixel of each frame of the decoded video, with the alpha value for a pixel being based on the difference between the pixel color in the decoded video and a key color. The alpha value is then used to determine whether to display that pixel color on the pixel.
In an embodiment, control information regarding a dynamic chroma-key is sent. The control information represents a dynamic chroma-key represents a key color that is not found within the associated image part of the video. This key color was used to replace the remainder of the associated user-extracted video. Should the image part of the video change and a pixel color changes to match the key color, a new key color is chosen to replace the remainder of the associated user-extracted video. The control information is then changed to represent the new key color.
In the following description, numerous details and alternatives are set forth for purpose of explanation. However, one of ordinary skill in the art will realize that embodiments can be practiced without the use of these specific details. In other instances, well-known structures and devices are shown in block diagram form to not obscure the embodiments with unnecessary detail. And the methods described within may be described in one order, but one of skill will realize the methods may be employed in a number of different orders.
In
Still regarding
First, regarding user-extracted video data 108, chroma-keying processing may be used to embed an alpha mask in the video frame. Such embedding is typically performed in real-time. An alpha mask represents a video frame using 0 or 1 for each pixel of that frame. Where the alpha mask contains a 0, that pixel is part of the background part of the user-extracted video. Where the alpha mask contains a 1, that pixel is part of the image part of the user-extracted video. An alpha mask is created during the extraction of the user from the video, which is discussed within. Video data 108 may then be compressed using a standard encoder or an encoder according to an embodiment (“Z-encoder,” see the discussion of
Second, regarding control information 106, this information is used to synchronize the sharing of content between host/sender 102 and receiver/viewer 110 displays. For example, should content 106 be a document and have been sent ahead of video data 108, then control information 106 would need information necessary to synchronize the page number of the document with video data 108. Control information 106 also contains rendering information, (e.g., the relative position, size, and degree of transparency of the user-extracted video 108, for rendering that video with the shared content 104).
Third, regarding content 104, such content may include, (e.g., documents, photos, presentation slides, video clips, and web pages) which may be uploaded from, (e.g., a user computer), and also from shared cloud services like Google Docs™, Microsoft Office 365™, YouTube™, Vimeo™, and SlideShare™. By splitting the data and handling different video streams with codecs and protocols that are matched to, or optimized for, the specific streaming data (e.g., still image or video), various system embodiments help to minimize transmission bit rate requirements while retaining visual quality. Codecs and protocols may, for example, be optimized to improve the resolution and frame rate of video 108, since video typically contains movement. And codecs and protocols for content 104 may be optimized to improve content details. In embodiments, “smart” strategies are employed that automatically choose different protocols based on the type of data (e.g., video, document, etc.) being transmitted.
In some embodiments, the sender processing flow may be as follows. First, a user persona is extracted from a video (see
Still regarding
In additional embodiments, the method 300 may further include the following. Content may be selected to accompany the user-extracted video. This content may be combined with the user-extracted video to create a composite video. In such a case, at 306, the priority of the image part would be determined in relation to the remainder of the composite video, at 308 the alpha mask would be used to encode the image part and the remainder of the composite video differently based in part on the priority of the image part, and at 310 the encoded composite video would be sent to the at least one receiving computer.
Regarding step 408, in some embodiments, the background part is not displayed at the receiver. Thus, it would be inefficient for the whole video frame to be compressed and transmitted for subsequent discarding of the background part at the receiver. Embodiments disclosed herein mitigate this inefficiency by embedding alpha mask information in the color frame and then executing a chroma-keying technique to separate the video frame into an image part and a background part. In such embodiments, the background part may be encoded or transmitted differently (these include, for example, its not being encoded or transmitted at all). Such is the case, for example, with conferencing applications where only the user's image (and not their surrounding environment) is to be combined or shared for embedding with virtual content. This treatment of the background part saves bandwidth by not transmitting unnecessary pixel data.
The choice of key color preferably satisfies the following requirements: 1) no pixel in the foreground area has the same color as the key color; 2) there is some safe Li norm distance between the key color and the closest color in the foreground pixel; and 3) the key color does not require frequent change and is chosen to minimize the size of the encoded video packets. The safe Li norm distance is chosen based on considerations such as data type, compression methods, and decoding methods.
Regarding the second requirement 2), the reason for the safe distance Li is that after applying encoding to the video and sending through the network, (e.g., the Internet), the color values may not be preserved correctly when uncompressed and decoded into the video for display. Rather, the decoder may give out color values that are similar to, but not the same as, the uncompressed ones. Thus, the presence of a safe Li norm distance ensures that the decoded key color values of the background part are always separated from decoded color values of the image part (or foreground area) of the user-extracted video.
Almost all codecs, such as VP8 or H264, prefer the input video in YUV color space for the ease and efficiency of video compression. Thus, regarding the static chroma-key technique, to convert from RGB to YUV color space, a fixed-point approximation is applied in most digital implementations.
Since the value range of the output YUV is normally scaled to [16, 235], it is possible to use the {O, 0, O} value for key color. This key color selection satisfies requirements 1-3, above. However, it is not always the case for all codec implementations that the range of YUV is limited to [16, 235]. In such cases, an embodiment proposes a dynamic chroma-key technique.
Still regarding
At 606, should no empty box of the chosen dimensions be found, the key color {yk, uk, vk} is chosen to minimize the expression:
Where:
w is the weight of each bin depending on its distance from the center of the box;
Dy, Ow Ov is the neighborhood area in y, u, v axis, respectively; and
H[y,u,v] is the bin value of color {y,u,v}.
If, at 606, E>O, which means that there is at least one pixel value in the image part/foreground area that has its color inside the center box and its neighboring boxes, then that pixel color value is modified so that it no longer lies inside the box. This works to avoid ambiguity in dechroma-keying step.
Compared to the static chroma-key method, the dynamic chroma-key method requires more computation and bandwidth. Therefore, it is preferable to use the dynamic method only when the static method cannot be applied.
Regarding quantization block 950, at 970 an alpha mask 910 from the user-extraction process may be used to drive the quality of a quantization block 950 so that macro blocks in the user-image or user-extracted region of a video frame 920, i.e., the more important sections, are quantized with more bits than the background. Alpha mask 910 allows the encoder to identify the location of the image part 216 (
Efficiencies are gained in compression by addressing the different requirements of the content. When content is shared, the changes in content that accompany a change in video frame are typically small. In such case the Z-encoder may compress only those changes in the content following the method 900 described above with respect to video frame 920. In an additional embodiment, should it be determined that the background or content portion of video frame 920 is actually more important than the user-extracted image, then the alpha mask 910 from the user-extraction process may be used to drive the quality of a quantization block 950 so that macro blocks in the background or content region of a video frame 920 are quantized with more bits than the user-extracted image using the method described. And, in general, method 900 does not require that video frame 920 has gone through the chroma-keying process. Furthermore, in an embodiment, alpha mask 910 may be used to drive the quality of a quantization block 950 with the information from alpha mask 910 added through optional path 980 to prediction block 930.
After dechroma-keying 416, the frame of decoded video 1010 and the generated alpha mask 1020 are sent to the alpha blending block 418 (
Cblended=α*Cvideo+(1−α)*Ccantent
The following contains sample Javascript HTML5 code for implementing aspects of the embodiments, such as: streaming live video, initializing video and canvas sizes, and binding post-processing actions to video during streaming.
Creating a persona by extracting a user image from a video will now be described regarding
For example, a single frame from the foreground video comprising a persona 1120 representing the user presenter 1420 may be embedded into text frame in a chat session.
As seen in
As such, the camera 1510 receives color information and depth information. The received color information may comprise information related to the color of each pixel of a video. In some embodiments, the color information is received from a Red-Green-Blue (RGB) sensor 1511. As such, the RGB sensor 1511 may capture the color pixel information in a scene of a captured video image. The camera 1510 may further comprise an infrared sensor 1512 and an infrared illuminator 1513. In some embodiments, the infrared illuminator 1513 may shine an infrared light through a lens of the camera 1510 onto a scene. As the scene is illuminated by the infrared light, the infrared light will bounce or reflect back to the camera 1510. The reflected infrared light is received by the infrared sensor 1512. The reflected light received by the infrared sensor results in depth information of the scene of the camera 1510. As such, objects within the scene or view of the camera 1510 may be illuminated by infrared light from the infrared illuminator 1513. The infrared light will reflect off of objects within the scene or view of the camera 1510 and the reflected infrared light will be directed towards the camera 1510. The infrared sensor 1512 may receive the reflected infrared light and determine a depth or distance of the objects within the scene or view of the camera 1510 based on the reflected infrared light.
In some embodiments, the camera 1510 may further comprise a synchronization module 1514 to temporally synchronize the information from the RGB sensor 1511, infrared sensor 1512, and infrared illuminator 1513. The synchronization module 1514 may be hardware and/or software embedded into the camera 1510. In some embodiments, the camera 1510 may further comprise a 3D application programming interface (API) for providing an input-output (10) structure and interface to communicate the color and depth information to a computer system 1520. The computer system 1520 may process the received color and depth information and comprise and perform the systems and methods disclosed herein. In some embodiments, the computer system 1520 may display the foreground video embedded into the background feed onto a display screen 1530.
Any node of the network 1600 may comprise a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof capable to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices (e.g. a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration, etc.).
In some embodiments, a node may comprise a machine in the form of a virtual machine (VM), a virtual server, a virtual client, a virtual desktop, a virtual volume, a network router, a network switch, a network bridge, a personal digital assistant (PDA), a cellular telephone, a web appliance, or any machine capable of executing a sequence of instructions that specify actions to be taken by that machine. Any node of the network may communicate cooperatively with another node on the network. In some embodiments, any node of the network may communicate cooperatively with every other node of the network. Further, any node or group of nodes on the network may comprise one or more computer systems (e.g. a client computer system, a server computer system) and/or may comprise one or more embedded computer systems, a massively parallel computer system, and/or a cloud computer system.
The computer system 1650 includes a processor 1608 (e.g. a processor core, a microprocessor, a computing device, etc.), a main memory 1610 and a static memory 1612, which communicate with each other via a bus 1614. The machine 1650 may further include a display unit 1616 that may comprise a touch-screen, or a liquid crystal display (LCD), or a light emitting diode (LED) display, or a cathode ray tube (CRT). As shown, the computer system 1650 also includes a human input/output (I/O) device 1618 (e.g. a keyboard, an alphanumeric keypad, etc.), a pointing device 1620 (e.g. a mouse, a touch screen, etc.), a drive unit 1622 (e.g. a disk drive unit, a CD/DVD drive, a tangible computer readable removable media drive, an SSD storage device, etc.), a signal generation device 1628 (e.g. a speaker, an audio output, etc.), and a network interface device 1630 (e.g. an Ethernet interface, a wired network interface, a wireless network interface, a propagated signal interface, etc.).
The drive unit 1622 includes a machine-readable medium 1624 on which is stored a set of instructions (i.e. software, firmware, middleware, etc.) 1626 embodying any one, or all, of the methodologies described above. The set of instructions 1626 is also shown to reside, completely or at least partially, within the main memory 1610 and/or within the processor 1608. The set of instructions 1626 may further be transmitted or received via the network interface device 1630 over the network bus 1614.
It is to be understood that embodiments may be used as, or to support, a set of instructions executed upon some form of processing core (such as the CPU of a computer) or otherwise implemented or realized upon or within a machine- or computer-readable medium. A machine-readable medium includes any mechanism for storing information in a form readable by a machine (e.g. a computer). For example, a machine-readable medium includes read-only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; electrical, optical or acoustical or any other type of media suitable for storing information.
Although the present embodiment has been described in terms of specific exemplary embodiments, it will be appreciated that various modifications and alterations might be made by those skilled in the art without departing from the spirit and scope of the invention. The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
This application is a Continuation of U.S. application Ser. No. 14/145,151, filed Dec. 31, 2013, entitled “Transmitting Video and Sharing Content via a Network,” naming Quang Nguyen, which is incorporated herein in its entirety for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
5001558 | Burley | Mar 1991 | A |
5022085 | Cok | Jun 1991 | A |
5117283 | Kroos | May 1992 | A |
5227985 | DeMenthon | Jul 1993 | A |
5343311 | Morag | Aug 1994 | A |
5384912 | Ogrinc | Jan 1995 | A |
5506946 | Bar | Apr 1996 | A |
5517334 | Morag | May 1996 | A |
5534917 | MacDougall | Jul 1996 | A |
5563988 | Maes | Oct 1996 | A |
5581917 | Barden | Dec 1996 | A |
5631697 | Nishimura | May 1997 | A |
5687306 | Blank | Nov 1997 | A |
5812214 | Miller | Sep 1998 | A |
5920659 | Iverson | Jul 1999 | A |
5995672 | Nagarajan | Nov 1999 | A |
6035060 | Chen | Mar 2000 | A |
6084982 | Challapali | Jul 2000 | A |
6122014 | Panusopone | Sep 2000 | A |
6150930 | Cooper | Nov 2000 | A |
6262778 | Nonweiler | Jul 2001 | B1 |
6411744 | Edwards | Jun 2002 | B1 |
6549659 | Welch | Apr 2003 | B1 |
6618444 | Haskell | Sep 2003 | B1 |
6661918 | Gordon | Dec 2003 | B1 |
6664973 | Iwamoto | Dec 2003 | B1 |
6798407 | Benman | Sep 2004 | B1 |
6970595 | Park | Nov 2005 | B1 |
7003061 | Wilensky | Feb 2006 | B2 |
7006155 | Agarwala | Feb 2006 | B1 |
7317830 | Gordon | Jan 2008 | B1 |
7517219 | McDonald | Apr 2009 | B2 |
7574043 | Porikli | Aug 2009 | B2 |
7633511 | Shum | Dec 2009 | B2 |
7773136 | Ohyama | Aug 2010 | B2 |
8094928 | Graepel | Jan 2012 | B2 |
8175384 | Wang | May 2012 | B1 |
8300890 | Gaikwad | Oct 2012 | B1 |
8320666 | Gong | Nov 2012 | B2 |
8379101 | Mathe | Feb 2013 | B2 |
8411149 | Maison | Feb 2013 | B2 |
8396328 | Sandrew | Mar 2013 | B2 |
8477149 | Beato | Jul 2013 | B2 |
8520027 | Itkowitz | Aug 2013 | B2 |
8565485 | Craig | Oct 2013 | B2 |
8649592 | Nguyen | Feb 2014 | B2 |
8649932 | Mian | Feb 2014 | B2 |
8659658 | Vassigh | Feb 2014 | B2 |
8712896 | Sheldon | Apr 2014 | B2 |
8818028 | Nguyen | Aug 2014 | B2 |
8913847 | Tang | Dec 2014 | B2 |
9386303 | Nguyen | Jul 2016 | B2 |
9942481 | Venshtain | Apr 2018 | B2 |
20020158873 | Williamson | Oct 2002 | A1 |
20040153671 | Schuyler | Aug 2004 | A1 |
20050094879 | Harville | May 2005 | A1 |
20050212820 | Liu | Sep 2005 | A1 |
20050213853 | Maier | Sep 2005 | A1 |
20070110298 | Graepel | May 2007 | A1 |
20070146512 | Suzuki | Jun 2007 | A1 |
20070201738 | Toda | Aug 2007 | A1 |
20070216811 | Oh | Sep 2007 | A1 |
20080181507 | Gope | Jul 2008 | A1 |
20090110299 | Panahpour | Apr 2009 | A1 |
20090244309 | Maison | Oct 2009 | A1 |
20100195898 | Bang | Aug 2010 | A1 |
20110064142 | Haskell | Mar 2011 | A1 |
20110193939 | Vassigh | Aug 2011 | A1 |
20110242277 | Do | Oct 2011 | A1 |
20110243430 | Hung | Oct 2011 | A1 |
20110249190 | Nguyen | Oct 2011 | A1 |
20110267348 | Lin | Nov 2011 | A1 |
20110293179 | Dikmen | Dec 2011 | A1 |
20120011454 | Droz | Jan 2012 | A1 |
20130307056 | Takaishi | Nov 2013 | A1 |
20140028794 | Wu | Jan 2014 | A1 |
20140306995 | Raheman | Oct 2014 | A1 |
20140307056 | Collet Romea | Oct 2014 | A1 |
20150071531 | Vlahos | Mar 2015 | A1 |
20150188970 | Kowshik | Jul 2015 | A1 |
20160048991 | Vlahos | Feb 2016 | A1 |
Entry |
---|
Kitagawa et al., “Background Separation Encoding for Surveillance Purpose by using Stable Foreground Separation”, APSIPA, Oct. 4-7, 2009, pp. 849-852. |
D.S. Lee, “Effective Gaussian Mixture Leaning for Video Background Subtraction”, IEEE, 6 pages, May 2005. |
Benezeth et al., “Review and Evaluation of Commonly-Implemented Background Subtraction Algorithms”, 4 pages, 2008. |
Piccarrdi, “Background Subtraction Techniques: A Review”, IEEE, 6 pages, 2004. |
Cheung et al., “Robust Techniques for Background Subtraction in Urban Traffic Video”, 11 pages, 2004. |
Kolmogorov et al., “Bi-Layer Segmentation of Binocular Stereo Vision”, IEEE, 7 pages, 2005. |
Gvli et al., “Depth Keying”, 2003, pp. 564-573. |
Crabb et al., “Real-Time Foreground Segmentation via Range and Color Imaging”, 4 pages, 2008. |
Notification of Transmittal of the International Search Report and the Written Opinion of the International Searching Authority, or the Declaration, for PCT/US2014/047637, dated Nov. 6, 2014, 11 pages. |
Number | Date | Country | |
---|---|---|---|
20160314369 A1 | Oct 2016 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14145151 | Dec 2013 | US |
Child | 15202478 | US |